Journal of Applied Psychology 


Edited by Donald G. Paterson, University of Minnesota 


Consulting Editors 


George K. Bennett, Psychological Corporation 
Walter V. Bingham, Washington, D. C. 
Harold E. Burtt, Ohio State University 
Allen L. Edwards, University of Washington 
Clifford E. Jurgensen, Minneapolis Gas Co. 
Irving Lorge, T. C. Columbia University 
Quinn McNemar, Stanford University 


Alexander Mintz, City College of New York 
James P. Porter, Danville, Iilinois 
Julian B. Rotter, Ohio State University 
Edward K. Strong, Jr., Stanford University 
Donald E. Super, T. C. Columbia University 
Morris S. Viteles, University of Pennsylvania 
Alfred C. Welch, Knox-Reeves, Minneapolis 





Table of Contents 


Nineteen-Year Followup of Engineer Interests: E. K. Strong, Jr 
Academic Achievement and Strong Occupational Level Scores: J. W. Gustad........... 
Interest Item Response Arrangement as it Affects Discrimination between Professional 


Groups: J. V. Zuckerman 


Communication, Supervision, and Morale: C. G. Browne and B. J. Neitzel 
Opinions on Communism of Air Force Police Trainees: N. E. Green 
Studies in Job Evaluation: 9. Validity of a Check List for Evaluating Office Jobs: M. C. 


SS Fe No oe. «eid ey 4 Re ae yale Coe eae 


SEM peu bE ce ond SERS < UN ON RON Me es we 97 


Specificity of Over- and Under-Achievement in College Courses: W. C. Krathwohl 

The Role of Tests in the Medical Selection Program: R. B. Ralph and C. W. Taylor.... 107 
Faking Personality Test Scores in a Simulated Employment Situation: A. G. Wesman... 112 
The Relationship between Ortho-Rater Tests of Acuity and Color Vision in a Senescent 


Group: R. W. Kleemeier 


Note on Table for Use with Spearman-Brown Formula: L. W. Cozan 
The Scaling of Stimuli by the Method of Successive Intervals: A. L. Edwards.......... 
Paired Comparison Ratings: 1. The Effect on Ratings of Reductions in the Number of 


Pairs: E. J. McCormick and J. A. Bachus 


Dial Reading Performance as a Function of Brightness: S. D. S. Spragg and M. L. Rock.. 128 
Critique of Rock’s “A Sales Situation Test”: J. Bernard 

Answer to Bernard’s Critique of Rock's “A Sales Situation Test’: M. L. Rock 

Editor’s Reply to Bernard’s Criticism: D, G. Paterson 

Special Review: Intelligence and Cultural Differences: J. G. Darley 


Book Reviews 
New Books, Monographs, and Pamphlets 





American Psychological Association 


Vol. 36, No. 2 


April, 1952 





Journal of Applied Psychology 


Published Bi-monthly by the American Psychological Association, Inc. 
Prince and Lemon Sts., Lancaster, Pa. 


Annual subscription, $6.00; single copies, $1.25 


Subscriptions and business communications should be sent to 
American Psychological Association 
1515 Massachusetts Avenue N.W. 
Washington 5, D. C. 


Articles for publication and books for review should be sent to the Editor 


Professor Donald G. Paterson, Department of Psychology 
University of Minnesota, Minneapolis 14, Minnesota 





This Journal gives prompt consideration to 
manuscripts reporting original investigations in 
any field of applied psychology except clinical 
and consulting psychology. A descriptive or 
theoretical article is occasionally accepted if it 
deals in a distinctive manner with a problem of 
applied psychology. The policy is, however, to 
fayor papers dealing with quantitative investi- 
gations of direct value to psychologists working 
in the follewing fields: Vocational diagnosis and 
occupational guidance; educational diagnosis, 
prediction and guidance at the secondary school 
level and higher; personnel selection, training, 
placement, transfer and promotion in business, 
industry and government service including the 
armed forces; supervisory training in business, 
industry and government; bio-mechanics or de- 
sign of machines to fit the human operator; il- 
lumination, ventilation and fatigue in industry ; 
job analysis, description, classification and eval- 
uation; measurement of morale of executives, 
supervisors, or employees; surveys of opinion on 
social or political issues, such as those conducted 
by The Psychological Corporation ; psychological 
problems in market research and in advertising. 


Articles may be under 500 words. The maxi- 
mum is 12,000 words, the average in the 


neighborhood of 4,000 words. To reduce lag of 
publication, adherence to the rule of “brevity 
consistent with clarity” is encouraged. 


A lapse of six to twelve months occurs between 
acceptance of an article and its publication, the 
lag varying with the rate at which manuscripts 
are submitted. If, however, an author is pre- 
pared to defray the costs of printing the neces- 
sary extra pages, he may arrange for earlier 
publication without thereby postponing the ap- 
pearance of manuscripts by other contributors. 
This enables the management to provide space in 
addition to the scheduled 64 pages per issue. 
“Early publication” is thus a direct contribution 
to the subscribers. By cutting down lag in pub- 
lication, it also benefits those authors whose 
articles are published in regular turn. 


Tables, footnotes and references as well as 
text of manuscripts should be typed double-spaced 
throughout. Authors should adhere to the con- 
ventions described by J. E. Anderson and W. 
L. Valentine in “The preparation of articles for 
publication in the journals of the American 
Psychological Association,” Psychol. Bull., 1944, 
41, 345-376. A reprint of this article will be 
loaned to any prospective contributor who docs 
not find it in his library. 


Entered as second-class matter, August 19, 1943, at the post office at Lancaster, Pa., under the act of March 3, 1879 


Acceptance for mailing at bg special rate of postage provided for in c- Be (d-2), Section 34.40, 
. L. & R, of 1948, authorized October 10, 194 


deste dak teoitaeiatetnans ness tn 





Journal of Applied Psychology 








VoL. 36, No. 2 


APRIL, 1952 








Nineteen-Year Followup of Engineer Interests 


Edward K. Strong, Jr. 


Stanford University 


Knowing only a college freshman’s score in 
engineer interest, can one predict with some 
degree of certainty: (a) his college major; 
(b) his occupational choice while a freshman or 
sophomore; and (c) the occupation he will be 
engaged in 19 years later? Actually one can- 
not predict the specific college major, occupa- 
tional choice, or occupational career but one 
can predict surprisingly well whether the 
occupation will be engineering, some occupa- 
tion closely related to engineering or, at the 
other extreme, some occupation quite unrelated 
to engineering. 

The data in this investigation are based on 
the Vocational Interest Blanks of 306 Stanford 
University freshmen of 1930, a goodly propor- 
tion of whom also filled out the Blank in 1931, 
1939, and 1949. On each occasion extensive 
information obtained regarding their 
education, their vocational choice, and the 


was 


positions they had held, together with a 
varying amount of reaction to their past and 
present activities. 

Consider first how permanent or persistent 
are the interests measured by the Vocational 
Interest Test, and second, how well measured 
interests predict choice of occupation. 


Reliability of Interest Scores 

The popular notion is that interests change 
so often and so unpredictably that no forecast 
can be made on such a basis. The facts are 
otherwise as is demonstrated below. 

Using the odd-even technique, the reliability 
of the engineer interest scale of the Vocational 
Interest Test is .936 (3, p. 77, 4). Burnham 
(1) reported a coefficient of .95, using the 
test-retest technique for one week and Glass 
(2) reported .92 for one month. 

Permanency of engineer interest scores with 


Table 1 


Permanence of Engineer Interest Scores, Test-Retest Correlations 


Group 

High school juniors 
College freshmen 
College freshmen 
College freshmen 
College freshmen 
College sophomores 
College sophomores 
College seniors 


Interval in Years 


9 


College seniors 83 
College freshmen 10 years later 139 87 


* Burnham (1) reports .78 for 188 college freshmen; Glass (2) reports .71 for engineering students who con- 
tinued in college and .66 for 85 students who dropped out of college; and Van Dusen (7) reports .85 for 76 college 
freshmen. The .75 given in the table is a weighted average of these. 


65 





: 
i 
% 


66 Edward K.. 
our college freshmen is .91 for one year, .77 
for nine years, and .76 for nineteen years. 
For the ten year interval between 1939 and 
1949 the correlation is .87. See Table 1. 
Permanence of interest scores from 1 to 22 
years has recently been published (5) in which 
two profiles of 34 interest scores were corre- 
lated. These coefficients average .06 lower 
than those given in Table 1. The two different 
methods of calculating permanency of interest 
scores may be responsible for the differences in 
coefficients. It is more likely that the differ- 
ence is to be explained on the basis that the 
engineer interest scale is one of the most 
reliable of our scales, averaging .06 higher than 
the average reliability of .877 for 36 scales. 
Reliability, or constancy, of engineer interest 


Strong, Jr. 


of cases. Records of several hundred men 
tested before and after their war experiences 
showed very little change in their profiles of 
34 occupational interest scores. But there 
were real changes among a few. 

Among these freshmen there were 101 cases 
among 1115 in which there was a shift of 15 
or more in engineer score. In terms of the 
distribution of scores about 36 such cases might 
be expected but not as many as 101 cases. 
Among these exceptional cases there were some 
who shifted 15 or more in one direction between 
1931 and 1939 and shifted back again between 
1939 and 1949. Such men are counted twice. 
Elimination of these 9.2 per cent of the total 
reduces the average standard deviation from 
8.8 to 7.0. Even with the elimination of these 


"Table 2 


Standard Deviations of Differences between Test and Retest Scores 


All Data 
Ratio of 
Scores Average 
45 to 68 7.4 84 
30 to 44 x 8.8 100 
15 to 29 8.7 99 
-5to 14 117 


S.D 


Total 100 


scores is shown in general in Table 1. Reliabil- 
ity is not equal, however, over the entire range 
of scores ranging from 60 to —5. The stand- 
ard deviations of differences in scores between 
test and retest have been calculated for each 
of six intervals, i.e., 1930-31, 1930-39, 1930-49, 
1931-39, 1931-49, and 1939-49. Averages of 
the six sets of data are given in Table 2. The 
A ratings (scores of 45 to 68) have the lowest 
standard deviation, amounting to only 84 per 
cent of the average and the low scores of —5 
to 14 have the greatest deviation, amounting 
to 117 per cent of the average. It is fortunate 
that the high scores, upon which interpretation 
very largely rests, are the most reliable of all 
scores. 

Constancy of interest scores over long 
periods of time is remarkable. But this is not - 
so true for a small minority of 5 to 10 per cent 


Differences of 15 and More 
Omitted (9.2% of Cases) 


Per Cent of 
Scores with 
Diflerences of 
15 and More 
64 91 4.4 

104 8.4 
97 10.2 
108 15.4 


Ratio of 


$.D. Average 


100 9.2 


extreme cases there is clear evidence that the 
high scores are more reliable and the very low 
are less reliable than the average, 
although now the differences are not so notice- 
able. The last column of Table 2 makes clear 
that part of the greater reliability of high 
scores is the relative absence of cases with 
large shifts in score. 

There is no noticeable regression of scores in 
the interval of 1930-31 and very slight regres- 
sion in the interval of 1939-49, but there is 
appreciable regression in the four other 
intervals between 1930-39, 1930-49, 1931-39, 
and 1931-49. Data from all six test-retest 
intervals have been combined, however, and 
the data given in Table 3. High engineer 
scores of 60 to 68 regress 5.4 downward and 
the low scores of —5 to 9 regress 8.4 upward. 


scores 


_ On the engineer scale the point of no regression 





Nineteen-Year Followup of Engineer Interests 


Table 3 


Regression of Engineer Interest Scores 


Table 4 


Mean Engineer Interest Scores of College Freshmen} 





Score N Regression 


60 to 68 —5.4 
50 to 59 —2.9 
40 to 49 —. 
30 to 39 
20 to 29 
10 to 19 
—5to 9 


Total 1115 


is about 38. No explanation occurs to us as 
to why scores regress upward and downward 
from 38 as 38 is above the average score of 31 
of non-engineers and much above the chance 
score of 23. 

There may be high permanency of scores as 
measured by correlation or standard deviation 
of differences in scores and at the same time 
there may be increase or decrease in mean 





Year 
Tested 
1930 
1931 
1939 
1949 


Mean* 
31.3 
31.0 
34.1 
34.2 


* Differences in mean scores from 1930 to 1939 and 
1949 are significant at the 5 per cent level and differ- 
ences from 1931 to 1939 and 1949 are significant at the 
1 per cent level. 


scores. In this case mean scores have changed 
very little as is shown in Table 4. College 
freshmen scores did not change when retested 
as sophomores but did increase by about’ 3 
scores in 1939 and 1949. 


Distribution of Engineer Interest Scores 


Table 5 gives the distribution of engineer 
interest scores of the criterion group of 513 


Table 5 


Distribution of Engineer Interest Scores of Adult Engineers; College Freshmen and Seniors; and 4 Sub-Groups 
of Freshmen who became Engineers, Chemists, etc., Physicians, and Lawyers in 1949 








513 306 285 
Freshmen 


Engineers Seniors 
1 
3 

13 

17 

18 

16 

16 
8 


td 
rN Oo ON 


5 
2 
1 


_ 
a 
So uw 


—--P eu 


Mean 50.0 
o 10.0 


ow 
~ 
w 


Per cent overlapping 
with 513 engineers 


Occupation in 1949 of Freshmen 


24 13 31 21 
Lawyers 


Engineers Chemists,etc. Physicians 











68 Edward K 


adult engineers and of the 306 freshmen. The 
two groups overlap 46 per cent. The freshman 
group is heterogeneous. When sub-groups are 
isolated, we find that freshmen who later 
become engineers overlap 99 per cent with the 
criterion group; those who become chemists, 
physicists and geologists overlap 91 per cent; 
physicians overlap 48 per cent; while lawyers 
overlap only 16 per cent. The interests of 
chemists, physicists, etc. correlate about .85 
with the interests of engineers, the interests of 
physicians correlate .52 and the interests of 
lawyers —.44. As the correlations between 
engineer and other groups decrease from 1.00 
toward —1.00, the mean engineer interest 


Strong, Jr. 


scores decrease and the per cent of overlapping 
between the two decreases. 


Freshman Engineering Interest and 
College Major 


Table 6 gives the distribution of engineer 
interest scores according to college major. 
The majors are arranged in order according to 
the mean engineer interest scores of the 
students so enrolled. Originally two tables 
were prepared, one concerned with under- 
graduate majors and one with graduate majors. 
As the results were very similar, the two tables 
have been combined. Men whose under- 
graduate and graduate records are both known 


Table 6 


Distribution of Engineer Interest Scores of Freshmen According to College Major: Major of Undergraduates 
and Graduates Combined (Men Who Did Both Are Counted Twice) 


Engineer Interest Scores 


College Major 10-19 20-29 
Engineering 


Chemistry 
Physics 


Geology 


Medicine 
Biology 


Education 
Mathematics 


Accounting 
Business 


English 


Economics 
Psy che rhc tas 
History 


Art, Music, Drama 
Political Science 
Social Science 
Philosophy 

Law 

Foreign Language 


Misc. 


30-39 


Engineer 
Average 
Score 


40-49 50-59 





NVineteen-Vear Followup of Engineer Interests 


are counted twice in the table. The records 
are fairly complete based upon the reports 
rendered in 1939 and 1949. No discrepancies 
were noted between the two records except 
that some men had not finished their academic 
work by 1939. A few students completed 
only one or two years of college work. They 
are included in Table 6 if their major could be 
determined from either the courses taken or 
their statements; otherwise they are excluded. 

Table 6 and others that follow contain halves, 
as, for example, 8.5 freshmen majoring in 
engineering with an engineer interest score of 
30-39. Some students gave majors in two 
fields, as economics-accounting, or they changed 
their major, for example, from engineering to 
physics. In such cases each major is tabulated 
as “4.” The data actually record choices 
not men. 

There were 101 freshmen who scored 40 and 
higher in engineer interest score. See Table 6. 
Of these 101 freshmen, 41.5 majored in engi- 
neering, 19 in chemistry, physics and geology, 
20 in medicine and biology, and 20.5 in some 
other field. The proportion scoring in engi- 
neering is higher if only scores of 50 and above 
are considered, i.e., 51 per cent in this case in 
contrast to 41 per cent when scores of 40 and 
higher are considered. In other words, if a 
freshman rates A in engineer interest there are 
about 82 chances in 100 he will major in a 
physical or biological science. 

There is a close relationship between engineer 
interest scores of freshmen and the subject 
matter of their academic work. See Table 6. 
As the engineer score decreases the students’ 
majors progressively shift from: physical 
sciences to biological sciences, to accounting 
and business, to social sciences, to law, and 
to foreign languages. 

Because there are very few students enrolled 
in most of the majors not too great reliance 
can be placed upon the data for these separate 
majors. If there were more cases we should 
confidently expect the major of mathematics 
to fall between geology and medicine and for 
education to appear somewhat lower in Table 6. 

Scores on the engineer scale and a few other 
interest scales, such as lawyer and accountant, 
ought to provide a good indication of what 
major the student would find appropriate 
for him. 


Engineering Interest and Freshman 
Occupational Choice 


When a freshman scores high in engineer 
interest does he choose engineering for his 
future occupation or most any other occupa- 
tion? 

Table 7 lists all the occupations chosen by 
three or more fréshmen among a total of 270 
freshmen. Engineer scores are also given for 
28 freshmen who didn’t know what occupation 
they would enter. There were in addition 8 
freshmen who gave a college major instead of 
an occupational choice whose records are not 
included in Table 7. 

A total of 48 freshmen chose engineering. 
The engineer;interest scores of these men are 
given in the rst row of data in Table 7, with 
a mean score of 45.0. Similarly the engineer 
scores of the 46 freshmen who chose law are 
given with the mean score of 21.0. The last 
row of the table gives the average correlation 
between engineering and the occupation chosen 
by freshmen. Thus in the column headed 45 
there are given the number of freshmen choos- 
ing the listed occupations, 7 choosing engineer- 


ing; 2 choosing chemistry; 2, geology; etc., 


down to 2 choosing a specific business activity, 
namely “wholesale grocer” and “auto busi- 
ness.”” 
Table 7 
engineer interest score (last column of the 
table) decreases the occupation chosen by 


clearly indicates that as mean 


the freshmen differs more and more from 
engineering. In order to summarize the data 
it is necessary to express in some statistical 
form the relationships between choice of 
engineering and choice of other occupations. 
Until we have some way of expressing such 
relationships we can not conclude, except on 
the basis of judgment, whether a shift from 
engineering to medicine is greater or less than 
a shift to.aw. To measure such relationships 
we have used correlation coefficients between 
the interests of engineers and the interests of 
men in other occupations. Thus, the amount 
of change from engineering to chemistry, or 
medicine, or law is expressed by the correspond- 
ing coefficients of .88, .52, and —.44 (3, 4). 
Unfortunately there are no such coefficients 
for many occupations which our students have 
chosen. On the basis of known correlations 








x 
as 
Cy 
= 
$ 
aS 
7) 
a 
> 
3 
3 
™ 


TU 
F'67 
Pre 
ep 
Ter 
LP 
O'SP 


aBdeIvay 


HH Oa 


mn 


or 


w 
~* 


OSGI Ul Se10dg 34sa10}UT yaautBury 





UOIJL[AIIOD) asvsIIAY 


,Mouy 3,4u0q,, 
[R10L 

sayurg 
}UIUTISIAUT 
JOAMP’'T 
uonvonpy 
ssouisng dytIedg 
DdIAIIG USIII0g 
Ssauisngy,, 
wsteuinof 
SNOIURTIPISIFY 
uviisAyg 
sorsAqg 
AZo[00r) 
ystuiay) 
Jaouisuy 
OF6T Ut S210G) 
jeuonrdns9 





£ qe L 


an10y.) JeuoNedns99 uvUIYSaly 0} IANV[PY S210Ig ysa19}UT JV9UISUy 


00'T 
Jgoulsuy 
quia 
UOIjRIIII0Z 





Nineteen-Year Followup of Engineer Interests 


estimates have been used for the remainder. 
The writer is confident that the majority of 
estimates are sufficiently accurate for this 
purpose. Although we possess a rather un- 
usual amount of information about these 
freshmen there are a number of cases where all 
we have is the title of the occupations and 
there is rio way of knowing what functions 
are performed. For example, does “moving 
pictures business” mean technical work and 
if so, is it engineering or artistic in nature, or 
does it mean operating a moving picture 
theater as a small business man? Our answers 
to such questions have been influenced by all 
the positions the man has held. The most 
difficult estimate to make concerned the term 
“business.” This term covers a wide range of 
activities, some of which do not correlate 
positively with one another, as for example, 
the correlation of —.63 between advertising 
man and production manager. Here we 
considered all the known correlations between 
engineer and business activities and arrived 
at the correlation of —.19 to represent the 
term “business.” As most business activities 
aside from production correlate negatively 
up to —.78 for selling life insurance it is 
possible that the coefficient of —.19 should 
be as low as —.40. An average of our cor- 
relations for specific business activities gives 
a coefficient of —.25. If —.25 had been used 
instead of —.19 the average of all cases would 
have been decreased from .20 to .19, an incon- 
sequential amount. Various calculations of 
this sort lead the writer to believe the esti- 
mates are sufficiently accurate for the general 
purpose of this article. 

Using such coefficients to express amount of 
change from engineering to other occupations, 
Table 7 indicates that as such correlations 
decrease from 1.00 to —.51 the mean engineer 
scores of freshmen choosing the occupations 
decrease from 45.0 to 20.8. See first and last 
columns of Table 7. The relationship may be 
expressed in another way, namely, as the 
engineer scores of freshmen decrease from 65 
to -—5 the average correlations between 
engineering and occupational choices decrease 
from .97 to —.40. See top and bottom rows 
of Table 7. The correlation between the two 
measures, i.e., engineer score and correlation 


Table 8 
Mean Engineer Interest Scores of Freshmen According 
to the Degree Their Occupational Choices 


Agreed with Engineering 


Occupational 
Choices 
Distributed Engineer Interest Mean Scores 
According to ——_—— 
Correlation with Freshman Sophomore 
Engineering Choice Choice 
1.00 45.0 48.8 
71 to .99 44.3 45.5 
.41 to .70 33.9 35.6 
00 to 46 34.4 35.5 
O01 to —.40 18.7 25.8 
Al to —.70 22.0 23.1 
71 to —.99 21.4 


Average 31.2 32.0 


between engineering and occupational choice, 
is .66. 

A table similar to that of Table 7 was 
prepared showing the relationship between 
engineer interest score in 1930 and sophomore 
choice of occupations in 1931. The results 
were similar to Table 7 and so are not repro- 
duced. This is not surprising since the 
correlation between freshman and sophomore 
choice is .80 (6). Table 8 summarizes Table 7 
and the unpublished table, showing the close 
agreement at the sophomore level between 
engineer interest score and occupational choice 
among those students selecting engineering, 
chemistry and closely related occupations. 
The rises in scores of 45.0 to 48.8 and 44.3 to 
45.5 are caused by a few men with low engineer 
scores changing their choices from these 
occupations to non-engineering occupations. 


Freshman Engineer Interest and Occupations 
19 Years Later 


Occupations engaged in in 1949 have been 
assigned to seven groups according to the 
correlations between the interests of engineers 
and the interests of men in other occupations. 
See column 1 of Table9. Detailed distributions 
of engineer scores are supplied in Table 5 of 
the 20 freshmen who became engineers, the 13 
who became chemists, geologists and physicists, 
the 31 who became physicians, and the 21 
who became lawyers. Table 9 again leads 











Edward K. Strong, Jr. 


to the conclusion that as engineer mean scores 
increase from 22.5 to 50.4 the occupation 
engaged in 19 years later approximates more 
and more closely that of engineering, that is, 
the correlation with engineering increases from 
—.71 to 1.00. And the same conclusion 
results whether interest scores obtained in 
1930, 1931, 1939, or 1949 are employed. The 
last column of the table expresses the above 
in terms of total overlapping. For example, 
the engineer interest of chemists, 
geologists and physicists overlap 93 per cent 
with the scores of engineers whereas the 
scores of salesmen overlap only 10 per cent. 
The correlation between engineer score in 
1930 and extent to which the occupation- 
engaged-in in 1949 deviates from engineering is 
55. This coefficient is not as high as the .66 
between 1930 score and 1930 occupational 
choice. But it is amazingly high for a correla- 
tion between freshman score in a single occupa- 
tional interest, and occupation 19 years later. 
Many factors contribute to choice of one’s 
occupational career. For the total population 
intelligence is a very important factor. But 
for our specific group of freshmen intelligence 
must play a much less important role. Min- 
imum requirements of academic ability in high 
school and intelligence test for admission to 
Stanford University are high enough so that 
most students can enter most occupations in 
so far as general ability is concerned. Lack of 
finance is a factor largely unrelated to intelli- 
gence or interest. An example of how lack of 
money interfered is that of a freshman who 


scores 


planned to be a physician. He was unable to 
finance attendance at a medical school. Some 
years later he obtained one year’s work in a 
dental school. Since then he has been a 
dental technician. He is not a physician as 
he planned to be while in college. But he has 
not deviated so very far from his original plan, 
as far as interests go, since the interests of 
dental technicians must correlate above .50 
with those of physicians. Family pressure has 
been another factor that prevented men from 
engaging in work in harmony with their 
interests. A-few of the worst misfits are men 
who early expected to enter their family 
controlled business or profession and today 
have well-paid positions therein, but clearly 
indicate lack of interest in their work. 
Exceptions to the General Trend 

The trend is unmistakable that as engineer 
interest scores increase from 0 to 68 students 
choose occupations while in college and enter 
occupations 19 years later more and more 
closely related to engineering. 

There are, however, individual exceptions to 
the general trend. A total of 55 freshmen 
among 306 obtained an A rating (score of 45 
to 68) on engineer interest in 1930. See Table 
7. No information is available concerning the 
subsequent occupation of ten of these men. 
Among the 45 for whom information is 
available, 18 became engineers and 27 did not. 
What explanation can be given for the fact 
that only 40 per cent with A ratings became 
engineers? 


Table 9 


Relationship of 1930 Engineer Score to Correlation Between Engineering and 1949 Occupation 





Correlation 
Between 
Engineer 
and 1949 

Occupation 

1.00 
71 to 88 
40 to 70 
OO to 39 
.O1 to —.39 
AO to 70 
70 to 78 


Engineer 
Purchasing Agent, Statistician 
Otiice Work, Accountant, 


Salesmen 


Potal, Non-Engineers 


* Only 9 cases in 1930, 7 in 1931, 8 in 1939 and 1949 


Examples of 1949 Occupation 


Geologist, Chemist, Industrial Engineer 
Physician, Farmer, Production Manager 


Writer, Personnel 
Lawyer, Sales Manager, Retail, Wholesale 


Per Cent 
Total Over 
lapping with 

Engineers 

in 1930 


Mean Engineer Interest 
Score in 


1930) 1931 1939 


50.4 
47.9 


51.3 
48.0 
35.7 
21.8 
28.1 
19.8 
20.3 


49.6 
50.2 
39.4 
26.8 
30.4 


25.8 


90 
44 
17* 
38 
21 


ww 
= 
= 


— 
nw w 


mM wh G 
NWN WwWNwYyv 
a~ um oI I 


Nm te 
a” 


20.8 


28.7 = 32.5 


ow 
Nm 


** Total is 218 in 1930, 182 in 1931, 148 in 1939 and 1949, 





Vineteen-Vear Followup of Engineer Interests 


Table 10 


Occupations Actually Engaged in in 1949 by 45 Freshmen with a Rating in Engineer Interest Score and Occupations 
They Should Have Entered on Basis of Their Highest Occupational Interest Score 


Occupation 
Assigned on 


Correlation 
with 
Engineer 


Basis of 


1.00 Engineer 20.8 
88 Chemist 7.3 
63 Farmer 
52 Physician 
.06 Production Mgr 

-.17 Cc: P. A 

8 Dentist 

50 Architect 
41 Printer 
1 
s 


3 President, Mig. Co 


— 


—mmmm Nw Ww 


Advertiser 


The average man scored on 34 occupational 
scales obtains about three A ratings. In other 
words, the average man has the interests 
peculiar to men successfully engaged in three 
different occupations. Consequently the 
chances are that among 45 men with A ratings 
in engineer interest only 15 of them would 
engage in engineering work. 

Actually these 45 men_ secured 
ratings, an average of 4.8 ratings each. Two 
reasons may be advanced for the unusual 
number of A ratings. First, the sample was 
selected on the basis of high scores in engineer 
interest. Men with high scores on any scale 
are likely to have more high scores all told 
than men not so selected. Second, engineer 
scores correlate .40 and higher with 14 other 
occupational scores in contrast to the average 
scale which correlates to this extent with only 
8 other scales. It is therefore to be expected 
that men with high engineer scores will 
average more A ratings than the average man. 

If the 45 men were distributed on a propor- 
tionate basis among the 217 A ratings we 
would have 9.3 entering engineering, 7.0 
entering chemistry, 5.2 entering farming, 3.7 
entering production management, 3.3 entering 
medicine, and 16.5 entering twenty other 
occupations. On this basis we have 9.3 enter- 
ing engineering in contrast to 18 who actually 
did so. 


217 A 


Highest Score 
N 


Occupation 
Actually 
Engaged in 
in 1949 
N 


(2 Geologists, 1 Physicist) 


(1 Industrial Engineer) 


(1 Dental Technician 


1 Owner of Moving Picture Business) 
(1 Writer) 


(6 miscellaneous, see text 


The above calculations assume that all A 
ratings are equally significant. The writer 
has frequently maintained that this is true, 
that an occupational interest with a score of 
45 and another occupational interest with a 
score of 65 should both be carefully considered. 
There is a great deal of truth in this statement. 
Nevertheless, the data we have handled in our 
twenty year follow-up of Stanford students 
make clear that the higher the score the 
greater the likelihood the man will actually 
enter the occupation. 

If now we consider, not the 217 A ratings, 
but only the highest score each man received, 
what occupations will be assigned to the 45 
freshmen on that basis? The left hand half of 
Table 10 gives the answer. On this basis 
20.8 would be engineers in contrast to 18 who 
did become engineers. The right hand half 
of the table gives the actual distribution. 
Thirty-two men can be definitely called 
engineers, chemists, farmers, and physicians 
matching 35.1 men in the left hand distribution. 
In addition seven men are listed within 
parentheses as approximating fairly closely 
occupations on the left hand side. This gives 
good agreement between theoretical expecta- 
tion and actuality on the part of 39 among 
the 45 cases. 

The six remaining cases are: (1) a director 
of organizational planning who might be 











74 Edward K 
likened to a production manager although 
another with the same title is neither an 
engineer or production manager by education 
or function (college major, engineering); (2) 
a clerk in a shipping firm in 1939 who chose 
business in 1930 and shipping in 1931 (major, 
economics) ; (3) a mail carrier who had 23 years 
of engineering and has held a miscellaneous 
assortment of jobs since then; (4) a partner in 
a small retail business (major, engineering); 
(5) a vice-president of a wholesale grocery 
owned by his family who chose this both 
freshman and sophomore years (major, his- 
tory); and (6) a salesman who has been in 
business activities since graduation (major, 
engineering). 

The actual occupational careers of the 45 
men and the expected careers on the above 
basis may be summarized by using the cor- 
relations between engineering and each of the 
occupations. Actually the men entered occu- 
pations correlating on the average .64 with 
engineering but if they had entered the occupa- 
tion on which they had their highest score the 
average correlation would have been .77. 

The reader may decide for himself whether or 
not the 45 men have entered occupations in 
reasonable agreement with their freshman 
interest scores. Considering the fallibility of 
tests today and all the factors that determine 
occupational choice which are independent, 
or largely so, of interests, such as health, 
ability, finance, and family pressure, the over- 
all agreement between engineer interest scores 
and choice of occupation is far greater than 
the writer would have anticipated. 


Summary 


Stanford University freshmen took the 
Vocational Interest Test as freshmen in 1930 
and the majority also took the test in 1931, 
1939, and 1949. A fairly complete record of 
their education and occupational career was 
supplied on each of the four occasions. 

This summary is restricted to scores on the 
engineer interest scale and the relationship 
of these scores to selection of college majors, 
occupational choices when freshmen, and occu- 
pations engaged in in 1949, nineteen years later. 

The reliability of the engineer interest scale 
is .936. Permanency of scores is .91 for one 
year, .77 for nine years and .76 for nineteen 
years. 


Strong, Jr. 


Freshmen who became engineers had scores 
when freshmen that overlapped 99 per cent 
with the engineer criterion group, whereas 
those freshmen who became physicians had 
engineer scores that overlapped 48 per cent 
with the criterion group. Similarly, for 
lawyers, the overlapping was 16 per cent. 

The relationships of occupations to engineer- 
ing are expressed by the correlations between 
the interests of men in the various occupations 
and the interests of engineers. 

As the mean engineer interest scores of 
freshmen increase from 0 to 68 there result: 
(1) a progressive shift in college majors from 
languages to law, social sciences, business, 
biological sciences, physical sciences and engi- 
neering; (2) a progressive shift in freshman 
occupation choice from occupations correlating 
on the average of —.44 with engineering to 
occupations correlating .97 with engineering 
and a correlation between the two measures of 
.66; and (3) a progressive shift in occupations 
engaged in nineteen years later which correlate 
—.71 with engineering to engineering itself, 
and a correlation between the two measures 
OF So. 

Whatever an interest test measures, whether 
interests, preferences, values, goals, or what 
have you, it measures something very stable 
and permanently possessed and something 
that contributes very greatly to occupational 
choice. 

Received November 16, 1951. 

Early publication. 


References 

. Burnham, PS. Sch. & Soc 
1942, 55, 333. 

. Glass, S. S. An investigational analysis of certain 
general and specific interests of engineering stu- 
dents. Ph.D. Thesis, Purdue University Library, 
1934. 

. Strong, E. K., Jr. Vocational interests of men and 
women. Stanford, California: Stanford Univer 
sity Press, 1943. 

. Strong, E. K., Jr. Manual for Vocational Interest 
Blank for Men. Stanford, California: Stanford 
University Press, 1951. 

. Strong, E. K., Jr. 


Stability of interests. 


Permanence of interest scores 
over 22 vears. J.appl. Psychol., 1951, 35, 89-91. 

. Strong, E. K., Jr. Amount of change in occupa- 
tional choice of college freshmen. In press. 

. Van Dusen, A. C. Permanence of vocational inter- 
ests. J. educ. Psychol., 1940, 31, 401-24. 





Academic Achievement and Strong Occupational Level Scores * 


John W. Gustad 
Vanderbilt University 


Following the recognition of non-intellectual 
factors in academic achievement, increasingly, 
in recent years, attempts have been made to 
find adequate measures of motivation by 
means of which to improve the efficiency of 
the prediction of scholastic success. Because 
of the method of construction involved, the 
Occupational Level score (henceforth to be 
referred to as OL) of the Strong Vocational 
Interest Blank for men has been suggested as 
a promising approach. Strong (7) developed 
this scale by contrasting the interests of high 
status, professional and business men and 
laboring men. Darley (2, p. 60) has described 
this variable as follows: ‘. . . a quantitative 
statement of the eventual adult ‘level of 
aspiration,’ represents the degree to which the 
individual’s total background has prepared 
him to seek the prestige and discharge the 
social responsibilities growing out of high 
income, professional status, recognition or 
leadership in the community.” Further, 
Darley (2, p. 66) suggests that, “... an 
excessively low occupational level score seéms 
at present to be associated with lack of 
‘staying power’ or ‘survival power’ in college 
competition.” 

Several research attempts have been made 
to assess the usefulness of this key. Strong 
(7, p. 201) studied a group of men in the 
Graduate School of Business Administration 
at Stanford, dividing his subjects into four 
sub-groups in terms of grades for one year. 
He then compared the mean OL scores for 
the upper and lower groups and found an 
insignificant difference of one point. He also 
computed the correlation between grades 
and OL scores, obtaining a coefficient of .114. 
Considering the restricted range of grades 
among graduate students, however, as well as 
on OL, this is not unusual. 

* The present study was conducted with the aid of 
funds provided by the Carnegie Foundation for the 


Improvement of Teaching, aid for which the writer 
hereby wishes to express his gratitude. 


75 


Berdie (1), studying engineering freshmen 
with regard to both college satisfaction, and 
scholastic achievement, found that scores on 
the Engineer’s key of the Strong correlated 
only .10 with satisfaction and .13 with grades. 
OL scores correlated .01 with satisfaction and 
.03 with grades. He further found no relation- 
ship between intensity of engineering interests 
and grades. 

Kendall (3), having sorted his men subjects 
into three groups in terms of OL scores, made 
an analysis of co-variance, testing the signif- 
icance of the differences between mean grade 
point averages among the groups with ac- 
ademic aptitude, measured by the Ohio State 
Psychological Examination, held constant. 
He found the variance ratio to be significant 
at between the .05 and .01 probability levels. 
He concluded that, “If used with caution, 
OL scores at the extremes of the distribution 
should be helpful to the counselor in making 
judgments concerning individual cases for 
scholastic success.” 

Ostrom (6, 5) has reported two studies of 
the same sort, one with twelfth grade boys, 
the other with college freshman men. In the 
first, he devised three measures of drive in 
addition to OL: an interview, a teacher rating, 
and a “Guess Who” questionnaire. These 
three all were significantly related to OL, 
giving further indication of its nature. Making 
an analysis of co-variance similar to that of 
Kendall, he found no significant differences in 
mean achievement, with academic aptitude 
held constant, between groups differing in 
OL scores. In his second study, with college 
freshmen, he set up six groups in terms of 
both OL and _ scholastic aptitude scores. 
Making an analysis of variance of the three- 
by-two table, he found academic success to 
be related to both aptitude and OL. 


Purpose 


The present study was undertaken for two 
reasons: first, to see whether, at the senior 


< nea alas ete Re: Bact Rete thre ae we et 


Ae ALS 








76 


college level where occupational choices are 
most clearly set in terms of major courses, 
OL predicts differential success; second, to 
allow for the effect of appropriate or inappro- 
priate vocational choice, judged in terms of 
protiles on the Strong. This latter point, it 
seemed, was particularly important. The 
question seemed to be if it was realistic to 
expect a student to channel his basic motiva- 
tion, as measured by OL, into studies if these 
were not related in turn to his occupational 
interests. 


Method 


At the beginning of the winter quarter, 
1950, Strong Vocational Interest Blanks were 
filled out by the junior men. Juniors were 
chosen because they would be making voca- 
tional choices implemented by the selection 
of a major, and because, by the junior year, 
there was greater likelihood that the vocational 


John W 


. Gustad 


interest patterns would have matured. Scores 
on the ACE Psychological Examination and 
quality point ratios (QPR) were obtained 
from the files of the University Counseling 
Service and the Registrar. These were avail- 
able for 134 men. All were students in the 
College of Arts and Sciences. 

The interest profiles were first examined to 
determine whether they were appropriate in 
terms of the students’ major choices. The 
system used was that outlined by Darley (2). 
In making the judgment of appropriateness, 
either a primary in the proper interest area or, 
if the student had no primary, a secondary 
pattern in the area was accepted as indicating 
an appropriate choice. Seventy-four per cent 
of the group had appropriate choices, 26 per 
cent inappropriate. Next, the appropriate 
choice cases were further subdivided according 
to major field. There were three groups: (1) 
those majoring in Business Administration; 


Table 1 


Analyses of Variance, with Co-Variance Adjustments for Academic Aptitude, of Scholastic Achievement 
for the Several Academic Groups Separated According to OL Scores 


Sum of 
Oq.: 


OL(Y) 


Sum of 
Sq.: 
QOPR(X) 
584 
15.130 


Source df. 


Between 


5.36 
Within 1.05 


1 
Bus. Admin. 8 531.05 


Potal 15.714 840.41 


Between O64 
Within 23 8.160 


132.08 


557.96 


Total : 8.224 690.04 


375.0 


5,543.0 


Between 280 
Within 7.738 


Miscellaneous 


Potal 8.018 5,918.0 
Between 


Within 


Votal 340 1,865.34 


Appropriate 33.600 30,374.66 


Total 33.840 52,240.00 


.224 
10.142 


824.64 
13,016.50 


Between 


Inappropriate Within 


Total 10.366 13,841.14 


10.165 
34.501 


2,405.06 
63,677.03 


Between 


Within 131 


otal 133 44666 66,082.96 


Adjusted 
Sum of 
X2 


O17 
13.479 


Mean 
Square Ir 


Sum of 


Products Decision 


$4.95 
174.94 


0085 
2095 


O284 Ace ept 


229.89 13.498 


46 
132.64 


A77 


6.004 


O89 
250 
133.10 6.181 


6.47 
153.17 


.187 
6.229 
159.64 6.416 


008 
2,406.01 . .278 


Accept 


2,491.23 .286 


Accept 


Py 94. 
10.10, 


10.328 


10.006 
29.122 


604.96 39,128 





Academic Achievement and Strong Occupational Level Scores 


(2) those following the pre-medical curriculum; 
and (3) a miscellaneous group. In all, there 
were then four groups: three appropriate, one 
inappropriate. 

Following this, a distribution of OL standard 
scores was made for all 134 students and then 
divided into thirds as nearly as_ possible. 
Each of the four groups was then further 
subdivided according to the following scheme: 
(1) high (OL=56); (2) middle (OL 52-56); 
(3) low (OL=51). 

Finally, analyses of co-variance were con- 
ducted, following the method outlined by 
McNemar (4, Ch. 15), one for each of the 
four groups separately, one for all appropriate 
choice cases, and then for all cases combined. 
In this, the null hypothesis was that there were 
no differences in mean quality point ratios 
between the three OL groups when academic 
aptitude, measured by the ACE, was held 
constant. 


Results and Discussion 


The results of the foregoing analyses are 
summarized in Table 1. For all groups except 
the total, the F tests were not significant. 
Consequently, it was concluded that OL, at 
this level, was not a significant predictor of 
scholastic success within advanced major fields, 
even when the students were pursuing curricula 
appropriate to their measured interests. 

For the total group, however, including 
both the appropriate and inappropriate cases, 
the F test was significant. From this, it 
appeared that the earlier studies were con- 
firmed but only in a limited sense. For the 
total group, the F ratio for the unadjusted 
quality point ratios was also significant 
(F= 19.32; d.f.=2 and 131; P<.01). 

While the results for the separate groups 
were not significant, it was felt that differences 
in OL scores among the groups might have 
contributed to the significant F test for the 
total group. To check this, an analysis of 
variance was made of the OL scores of the 
four groups. This is summarized in Table 2. 
The conclusion was to accept the null hypothesis 
that there were no differences between the 
four groups on OL. It would seem, therefore, 
that differences in OL scores had not accounted 
for the significant results obtained for the 
total group analysis. 


Table 2 


Analysis of Variance Testing the Differences in Mean 
OL Scores of the Academic Groups 





Mean 
Source squares f. Squares’ Ff Decision 
Between 5. 2 S38 131 Accept 
Within 3,252. 24.83 


Total 3,318.15 133 


It remains, then, to account for the obtained 
results and also to try to infer the guidance 
significance of these. As far as the results for 
the total group are concerned, they agree with 
earlier studies; it is in the major groups that 
the differences appear. The first and most 
likely explanation which suggests itself is 
restriction of range on all three variables: 
OL, grades, and ACE scores. College men 
all probably have relatively high OL scores; 
added to this is the effect of selective elimina- 
tion and survival which, by the junior year, has 
left only the more able students, both as to 
ability and achievement. Even among those 
in curricula for which their interest patterns 
are inappropriate, the differences in achieve- 
ment between OL groups was insignificant. 
Yet, when this group was pooled with the 
appropriate groups, the Total F test was 
significant. Reference to Table 1 will show 
that, since the F for the total appropriate 
group was insignificant while that for the 
total group was significant, it would seem that 
the addition of the inappropriate cases resulted 
in the significant difference. In some way, 
difficult at present to explain, the inappro- 
priate cases appeared to differ from other 
students. 

However, a dean, considering applicants for 
senior college, could use OL scores with caution, 
especially if he took other factors into account, 
to predict success, since in most such groups 
there will be students with both appropriate 
and inappropriate choices. A counselor, on 
the other hand, concerned with an individual 
student and knowing whether the interest 
pattern and curricular choice were in line, 
would probably not find the OL score partic- 
ularly useful. The one exception might be the 
case of the student whose OL score was very 
low, approximating that of unskilled or semi- 


as Pahl te Ai date iho tht ttl in tis ab ae 


a et TE IME are 








78 John W. 
skilled workers. Yet, in the present study, 42 
cases (low OL group) had OL scores at or 
below the point (standard score 51) which 
Strong (7, p. 196-197) has indicated as being 
subprofessional. 

There seem to be two research designs which 
might be useful in avoiding the restricted range 
problem. First, one might use as subjects a 
group of freshmen engineers, since they are 
following from the outset a curriculum pre- 
sumably near their interests. The study by 
Berdie (1), however, yielded insignificant 
results, although he did not hold academic 
aptitude constant. The other approach would 
make use of the longitudinal design so that a 
measure of the restriction of range might be 
obtained and used. 


Summary and Conclusions 


Having studied the relationship between 
OL scores and college grades, with scholastic 
ability held constant, among a group of junior 
Arts college men, the following conclusions 
seemed to be warranted: 

1. Considering each of the groups separately 
(major groups, inappropriate group, total 
appropriate group), there were no differences 
in grades between groups separated in terms 
of OL scores. 

2. For the total group, appropriate and 
inappropriate cases pooled, there were signif- 
icant differences between scholastic achieve- 
ment means of the OL groups. 


Gustad 


3. No differences were found between the 
major groups in terms of OL scores. 

4. Restriction of range on all three variables 
was suggested as the most likely explanation 
of the negative findings. 

5. The conclusions of previous studies were 
partially supported, but the findings appear to 
have more potential value in selection than in 
guidance. 

6. Two research designs were suggested by 
means of which it might be possible to avoid 
the restricted range problem. 


Received May 7, 1951. 


References 


. Berdie, R. F. The prediction of college achievement 
and satisfaction. J. appl. Psychol., 1944, 28, 
239-245. 

2. Darley, J. G. Clinical aspects and interpretation of 
the Strong Vocational Interest Blank. New York: 
The Psychological Corporation, 1941. 

. Kendall, W. E. The occupational level key of the 
Strong Vocational Interest Blank for Men. J. 
appl. Psychol., 1947, 31, 283-287. 

. McNemar, Q. Psychological statistics. 
John Wiley and Sons, 1949. 

. Ostrom, S.R. The OL key of the Strong Vocational 
Interest Blank for Men and scholastic success at 
the college freshman level. J. appl. Psychol., 
1949, 33, 51-54. 

. Ostrom, S. R. The OL key of the Strong test and 
drive at the twelfth grade level. J. appl. Psy- 
chol., 1949, 33, 241-248. 

. Strong, E. K., Jr. Vocational interests of men and 
women. Stanford: Stanford University Press, 
1943. 


New York: 





Interest Item Response Arrangement As It Affects Discrimination 
Between Professional Groups * 


John V. Zuckerman 


Human Resources Research Office, The George Washington University 


An important aspect of interest measurement 
methodology is the question of how much 
different item arrangements contribute to 
discrimination between various groups. This 
problem is of particular significance when the 
groups to be distinguished are quite similar 
in their work and interests, as, for example, 
specialty groups within a single profession. 
It is reasonable to suppose that some particular 
item form might be more effective than another 
in “squeezing out’ such small differences as 
would be assumed to exist. 

This study concerns an effort to determine 
the relative merits of two types of interest 
item response arrangement in discriminating 
among .the interests of professional groups. 
The research was one phase of a project on 
the development of an interest instrument 
intended for medical specialists. 


Interest Inventory Item Arrangement 


There are two methods of arranging interest 
items in general use. L-I-D (like-indifferent- 
dislike) items permit the choice of a response 
among a graded series of attitudes toward a 
statement. Forced-choice items require the 
selection of one or more alternative statements 
over another or others. 

The Strong Vocational Interest Blank, one of 
the two best-known interest inventories, uses 
many L-I-D and similar items permitting a 
choice among responses (320 out of 400 items). 
The Kuder Preference Record—Vocational, 
consists of 168 triadic items, requiring forced 
choices of the best- and least-liked statements 
in each group of three. 

One comparison of the two item arrange- 
ments may be made with respect to the number 


* This article is based on research completed at Stan- 
ford University, while the author was a member of the 
Medical Specialists Research Project, Department of 
Psychology, and represents a portion of a thesis sub- 
mitted in partial fulfillment of the requirements for the 
Ph.D. degree at Stanford. The research_was sponsored 
by the Surgeon General, U. S. Army. 


of possible scoring weights available for any 
given statement. When two groups are 
contrasted in interests, a single L-I-D item 
can be given as many as three scoring weights, 
since any three of the six percentages in the 
response table which is produced can be 
changed independently of the others. For 
forced-choice items which use pairs of state- 
ments, but one weight can be secured for each 
two statements. 

Thus an advantage of the L-I-D item form 
is that it is theoretically possible to obtain 
more weights from a given number of state- 
ments in a given physical space than from 
a forced-choice arrangement using pairs of 
statements. Such a forced-choice arrange- 
ment would require a much longer inventory, 
taking more time to administer, if it were to 
equal the L-I-D form in number of weights. 

A possible advantage of the forced-choice 
item form has been brought out in a recent 
critique by Cronbach (1), who suggests that 
such item forms as L-I-D and Yes-No-? give 
rise to the possibility of responses not at all 
related to what the tests are designed to 
measure. These he terms “response sets.” 
Examples of response sets are answering “‘like” 
to all items on an interest inventory regardless 
of content, or using only the “like” and “‘indif- 
ferent” categories of response because of a habit 
not to dislike anything, or because of a special 
personal definition of disliking. Cronbach (2) 
further contends that such sets reduce test 
validity by introducing extraneous variance, 
and he states that the sets can be eliminated 
by the use of item forms requiring a choice 
among alternative responses, rather than the 
expression of attitudes toward a single state- 
ment. 

Evidence of a quantitative nature favors the 
L-I-D item form, since it has been shown to 
differentiate occupations and has been demon- 
strated to be reliable and valid for vocational 
guidance for some twenty years. However, 


| 
| 
; 








80 John V. Zuckerman 


the theoretical points raised concerning the 
possible additional discrimination provided by 
forced-choice forms provide sufficient reason 
for investigating the relative merits of the 
item arrangements, especially when discrimina- 
tion between similar groups is considered. 
Therefore the question was raised: In an 
interest inventory, which item arrangement 
provides more discrimination between profes- 
sional groups, forced-choice or L-I-D? 

Selection of Vocational Groups for Study 

Because the education profession contains 
well-defined specialty groups of considerable 
size, it was chosen for study. In addition, 
since engineers are known to differ considerably 
from teachers in their interests (Strong, 7) a 
group of electrical engineers was selected to 
contrast in interests with educators. 

It was hypothesized that there would be 
little difference between the two item forms 
for making the “easy” discrimination between 
the interests of educators and_ electrical 
engineers, but that for the “difficult” dis- 
crimination between sub-groups within educa- 
tion, forced-choice item forms would have an 
advantage because of a tendency to ‘squeeze 
out”’ differences of small size. 


Plan of the Study 


Two inventories, differing only in item 
arrangement, were administered to the same 
individuals, in several different professional 
groups, using a counter-balanced order of 
testing in order to control as many sources of 
error variance as possible. 

The responses of members. of criterion 
groups were subjected to item analysis, and 
occupational interest scales were developed. 
Then the scales were reapplied to the answer 
blanks of the criterion groups and comparisons 
of the discrimination between groups made 
from form to form. 


Procedure 


Interest Inventory Construction. An analysis 
of the characteristics of certain professional 
sub-groups provided the hypothesis that they 
differed in interests relating to their differing 


working functions. In a preliminary study of 
medical specialty groups differences in the 
functions of internists, pathologists, psychi- 


atrists and surgeons were noted. Four such 
modes of functioning were named and described 
as follows: 


Analytic: Preferences for problem-solv- 
ing, theorizing, reasoning. 
Preferences for using visual 
symbols, as in reading or 
map-reading. 
Liking for 
people. 
Manipulative: Tool-using preferences; liking 
for sports or operating ma- 
chinery. 


Visual: 


Social: working with 


The four modes were to be measured by 
interest items in inventories consisting of 
descriptions of occupational and avocational 
activities. 

An interest inventory was made up of 114 
paired-comparison forced-choice items based 
upon the functional scheme just mentioned. 
Data secured from a pretest on 117 college 
sophomores and 144 U. S. Army medical offi- 
cers were used to refine the modal scales, and 
provided a basis for the construction of a second, 
more refined interest instrument. 

Occupational descriptions from the DOT 
(8) and activity items from SVIB (7) were 
rewritten and others of a similar nature devised. 
A large number of these was submitted to 
six judges who classified the items with refer- 
ence to the modes of preference. 

The judges selected 65 occupational] descrip- 
tions and 54 activity items as unambiguous. 
Of these, 60 occupational items and 52 activity 
descriptions were chosen by the author and 
another psychologist and grouped in fours. 
Each group of four contained statements 
representing each of the four modes of dealing 
with the environment. Thus there were 
obtained 15 occupational description clusters, 
and 13 activity clusters. Within each cluster 
the items were equated for social prestige, 
intelligence, education and skills required for 
the activities, which it seemed subjectively 
desirable to hold constant within the groups 
of four statements. 

Although care was taken to hold the state- 
ment groups equal for social prestige of 
occupations or activities, it was not considered 
serious if some errors were made, since a 
recent study by Fehrer and Strupp (3) has 





Interest Item Response Arrangement 81 


shown that it makes little difference in 
responses if interest items vary in this manner. 

The clusters of items were arranged in a 
random order (separately for occupations and 
activities) and then pairs of statements were 
drawn at random from each cluster and 
arranged on a test form as A-B forced-choice 
items. This procedure was continued until 
all possible pairs (six for each cluster) had 
been formed. 

The resulting forced-choice interest inven- 
tory contained 90 occupational items and 78 
activity items, 168 in all. 

An equivalent L-I-D inventory form was 
produced by shuffling the single statements 
(occupational and activity items were treated 
separately). The form contained 112 items, 
60 occupations and 52 activities. 

The inventory was titled the Occupational 
and Activity Preference Blank from the nature 
of the items, the two forms being identified as 
Form FE (forced-choice) and Form OE (open- 
ended, or L-I-D). Instructions for self-ad- 
ministration using electrically scored answer 
sheets were prepared. 


Subjects. The educational profession is 


divided into two relatively distinct sub-groups 


with specific entrance requirements. ‘Teachers 
constitute the bulk of the profession (9), while 
supervisors and administrators, including 
principals, vice-principals and superintendents, 
make up the balance. Guidance workers, 
while negligible in percentage in the profession 
as a whole, are represented in sizable numbers 
in the training programs. Subjects in those 
specialties were selected for testing by visiting 
every class in education during a term at 
Stanford University which had more than 
100 students. Both men and women were 
tested, although only men were included in 
criterion groups. 

In addition to the educational specialists, 
a group of electrical engineering students in a 
graduate seminar was tested. 

Test Administration. Each group visited 
was given instructions concerning the purpose 
of the test, which was described as an evalua- 
tion of professional interests. All the subjects 
were asked to fill out a vocational information 
blank, data from which were used later to 
select criterion group members. 

Forms FE and OF of the OAPB were then 


passed to the subjects. These were marked 
so that half the individuals at random in each 
group received instructions to begin Form FE 
first and the balance were instructed to start 
with Form OE. No time limit was assigned 
for the completion of the blanks, but instruc- 
tions were given to work as rapidly as possible. 

One group was carefully timed. The timing 
for 36 individuals completing both inventories 
showed a median time of 24.5 minutes required 
for Form FE, while Form OE, physically about 
45 per cent the length of the other, required 
a median time of 12.6 minutes to complete. 

About 430 men and women students in 
education and 98 electrical engineering stu- 
dents were tested. Four hundred and eighteen 
completed pairs of blanks were secured from 
educators, and 94 pairs from electrical engi- 
neers. Not all the education students were 
included in criterion groups. The extra blanks 
were used in a reliability study of the interest 
scales which were developed. 


Treatment of the Data 


Composition of Criterion Groups. Three 
groups in education were defined: educa- 
tion students-in-general, administrators and 
teachers. All members were male, between 
21 and 55 years of age. Education students- 
in-general included 50 per cent preparing for 
careers in administration, 30 per cent for 
teaching and 20. per cent were guidance 
students, representative proportions of male 
students at Stanford. The term “administra- 
tors” was chosen to cover students prepar- 
ing for supervision and for administration. 
Members of this group were required to have 
three or more years of experience in education 
and to be in the specialty group at the time 
tested. Teachers were required to meet the 
same criteria. Both teacher and administrator 
groups contained only those individuals who 
expressed a desire to remain in their specialty 
group. 

The electrical engineering student criterion 
group consisted of men ranging from 21 to 55 
years of age, all of whom were committed 
to careers in electrical engineering and approved 
for advanced training by their department 
head. 

Construction of Occupational Interest Scales. 
For each of the criterion groups, item analysis 





82 John V. Zuckerman 


data were secured, and interest scales were 
prepared for both inventory forms. 

The scale system used was an adaptation of 
the method employed by Strong (7), in which 
interest data are weighted in terms of the 
differences between proportions of responses 
for two criterion groups. The datum from 
which differences are measured is termed by 
Strong the “point of reference” and the amount 
of differentiation between any two groups is 
in part a function of the point of reference 
chosen. 

For educational comparisons, the first point 
of reference used was interests of the education 
students-in-general group (N=150). Admin- 
istrators’ and teachers’ interests were each 
differentiated from these (scales were named 
“ADMINISTRATOR” and “TEACHER’’). 
Because the latter groups were small (admin- 
istrators, N=56; teachers, N= 41) a comparison 
was made directly between the two groups, 
using administrator interests as a point of 
reference (this scale was labeled “ADMIN- 
ISTRATOR-TEACHER” scale). 

The point of reference for the comparison 
of educator interests with those of electrical 
engineers was the education student-in-general 
group. The engineer group medium 
sized (N= 94). 

Strong’s weighting table requires criterion 
groups of at least 100, so his system was not 
used directly. Strong has shown (7) that 
other methods yield about the same results as 
his own. One of these is a scheme developed 
by Guilford for securing item weights (4) 
which is usable both for forced-choice and 
L-I-D items. 

The Guilford method weights each item from 
zero to plus or minus four, in accordance with 
a formula which takes into account both the 
magnitude of differences and the amount of 
confidence one has that they represent true 
differences. 


was 


By means of the Guilford system, scoring 
keys were produced for the four comparisons 


made for each form of the OAPB. Weights 
of more than unity were used only in the 
contrast of the interests of educators and 
electrical engineers, where some weights of 
two and three were employed. 

The answer blanks of the criterion groups 
were scored for each scale applicable to each 


group, and blanks for 171 men and women 
not in criterion groups were scored for all 
scales to provide reliability information. 


Results 


From the scores of the criterion group 
members, means, sigmas and standard errors 
of the means were computed. Differences 
between mean scores were evaluated for 
significance. Table 1, below, presents the 


Table 1 


Differences Between Mean Scores of Professional 
Groups on Four Interest Scales of Form FE 
and Form OE, OAPB 


Scale ‘ 
Name Dy D 


Groups Contrasted 
Form FE 
E-ducators-in-General, 

Electrical Engineers 
Administrators, 

Teachers 
Administrators, 

Teachers Tea 
Administrators, 

Teachers 


Ed-Eng 


Adm 


Form OF 
I-ducators-in-General, 
Electrical Engineers 

Administrators, 


Ed-Eng 
Teachers Adm 
Administrators, 
Teachers Tea 
Administrators, 
Teachers 


Ad-Tea_ = 11.0 


* All critical ratios are significant at or beyond the 
.001 level of confidence. 


mean differences, which can be seen to be 
highly significant for each comparison. 

To evaluate the differences in the discrimina- 
tion produced by the two different item forms, 
a statistic which would take into account 
both central tendency and spread was em- 
ployed. The measure was devised by defining 
a measure of area common to two distributions, 
proportion of overlapping. This was taken 
as the proportion of scores of one group 
falling in the region between the tail of the 
distribution and an ordinate raised at half the 
sigma distance between the means of the two 





Interest Item Response Arrangement 


distributions. This the 
assumption that the distributions of 
scores are normal, and that the scores are 
obtained with the same measuring device. 


statistic involves 


two 


Proportions of overlapping were calculated 
for each discrimination made with each form 
of the OAPB as follows. The difference 
between each pair of raw mean scores (for the 
same discrimination) was divided by twice 
the average standard deviation of the two 
distributions, thus locating an ordinate half- 
way between the means. The standard score 
value for this cutting point ordinate was 
converted to a raw score value for the distribu- 
tion with the larger N. In the cases where 
this distribution had the /ower raw mean value, 
all scores between the highest and the cutting 
point were tallied. In the cases where the 
distribution chosen for computation possessed 
the higher mean value, the scores between the 
lowest score and the cutting point were tallied. 


Table 2 


Ditierences Between Proportions of Overlapping for 
Two Fornis of the OAPB, on Four 
Interest Comparisons 
Proportion 
of Overlap 
=< (1) 
Form Form Dp. 
Groups Contrasted FE OE Ov’lp 
Educators-in-General, 
Electrical Engineers .15 14 (OI 
Administrators, 
Teachers mm 20 03 
(Administrator 
Scale 
Administrators, 
Teachers 
(Teacher Scale) 
\dministrators, 
Teachers 
(Administrator- 
Teacher Scale) 


(1) Positive differences are in favor of Form OF, 
that is, Form OE provides the smaller proportion of 
overlapping. 

(2) Standard errors of the differences computed by 
using McNemar’s formula 28a (6) which takes into 
account the correlational factor due to use of the same 
subjects for both test forms. 

(3) None of the differences is significant. 


Table 3 


Product-Moment Reliabilities for Four Occupational 
Interest Scales of Form FE and 
Form OE, OAPB* 


Scale 


Form FE 
Ed-Eng 
Adm 
Tea 
Ad-Tea 

Form OF 
Ed-Eng 5 86 
Adm ( 60 
Tea 32 48 
Ad-Tea 5 .67 


* Calculated from scores of 171 education students 
not in criterion groups 

** Corrected by Spearman-Brown prophecy formula 
for test length. 


The tallies were each converted into proportions 
of the chosen distribution, the proportions of 
overlapping. Actually, if the two distributions 
were normal, with equal sigmas, the true 
proportion of overlapping scores would be 
equal to twice the proportion of overlapping. 

Differences between the proportions of 
overlapping obtained for the two different 
forms of the OAPB were computed, and 
standard errors of the differences obtained 
(McNemar’s formula 28a (6) was _ used). 
These data are presented in Table 2, below. 

The reliabilities for the scales developed 
for both forms of the OAPB were obtained. 
These are presented in Table 3, and it may 
be noted that they are quite comparable from 
form to form, with one exception. The value 
for the TEACHER scale for Form FE is 
considerably lower than that for the L-I-D 
form, Form OE. This may have been due to 
the small size of the criterion group used for 
securing'/the scale weights (N=41) and to a 
scale with comparatively few weights (33). 

Discussion 

The results of this investigation were 
clear-cut. For each comparison, from the 
discrimination of the interests of electrical 
engineering students from those of education 


students-in-general to the separation of the 
interests of teachers from those of education 








84 John V. Zuckerman 


students-in-general, the two inventories used 
performed in an almost identical manner. 

Interpretation of the results, however, is 
dependent upon a number of contingent 
factors. There are two classes of these, the 
first being those limitations imposed by the 
experimental design, and the second kind 
differences inherent in the item forms used. 

The experiment was restricted to professional 
people, who were presumably not motivated 
to mislead the experimenter or to fake their 
scores. The interest inventories were con- 
structed to be understandable to the subjects, 
so that they should not have had any tendency 
to respond in a manner unrelated to what the 
inventories were intended to measure. It is 
not known what would have occurred had the 
blanks been ambiguous, or too difficult for 
the respondents, or had the situation been 
one to induce faking. In those cases the 
discriminations obtained with the two item 
forms might have been quite different. 

Another limitation in the experimental 
design was the arbitrary method of selecting 
the statements to be linked in the forced-choice 
test items, and the use of pairs as the units 
of comparison. Forced-choice items may be 
constructed with more than two alternatives, 
and Kuder (5) states that his triadic items are 
as reliable as paired-comparison items. 

Those differences due to the item forms can 
be evaluated quantitatively, and must be 
considered in interpreting the research results. 
It has been already indicated in the section 
on interest item arrangement that each L-I-D 
type statement could provide a maximum of 
three response positions to be weighted when 
two groups are compared. Forced-choice 
items using paired-comparisons can provide at 
most only one weight for each two statements. 
The itéms used in the forced-choice Form FE 
are twice the lengths of the L-I-D items in 
Form OE of the OAPB. Ii all the A-B items 
on a forced-choice form were weighted on a 
given interest scale, the form would require 
twice the administration time that an L-I-D 
form with the same number of ifems would 
need. If the L-I-D form were weighted in all 
possible positions, the A-B form would provide 
only one-third the number of weights that the 
open-ended form would yield. Thus, an L-I-D 
form could conceivably be one sixth the length 


of a paired-comparison test form, and yield the 
same number of weights (no consideration is 
given here to the relative sizes of the weights; 
in this study the L-I-D form provided a greater 
range of weights than the forced-choice form). 

Since each statement represented in Form 
OE appeared three times in Form FE, Form 
OE was physically about 45 per cent the 
length of the other. Form OE also took only 
half the time to administer, yet produced the 
same total discrimination in terms of over- 
lapping of interests and scales of the same 
reliability. 


Summary and Conclusions 


An important problem in interest measure- 
ment concerns the relative effectiveness of 
different item response arrangements in dis- 
criminating among the interests of professional 
groups. 

In this study an interest inventory was 
designed in two comparable forms, one using 
L-I-D items and the other using forced-choice 
paired-comparisons, to discriminate between 
professional groups, and the relative merits of 
the two forms were assessed. 

Based upon the resultant discrimination per 
unit item length and unit time required for 
administration of the two forms, it is con- 
cluded that L-I-D test item arrangement in 
this study is clearly superior to forced-choice. 
Cronbach’s criticism of this item type seems 
not well-founded, in terms of its performance 
in discriminating the interests of professional 
groups. The hypothesis which was offered 
about the superiority of the forced-choice item 
form for discriminating between subgroups 
within a single profession was not upheld. 

The study was limited to professional 
persons who were not motivated to fake and 
who presumably understood the item contents. 
Also, the items were selected in accordance 
with a functional scheme which imposed its 
limitations on the results. In addition, only 
pairs of alternatives were used in making up 
the forced-choice items. It is not known what 
would have occurred had triadic items been 
used in the forced-choice form. Further 
investigation is necessary to secure information 
on these points. 


Received May 25, 1951. 





Interest Item Response Arrangement 85 


References 


. Cronbach, L. J. Response sets and test validity. 
Educ. Psychol. Measmt., 1946, 6, 475-493. 

. Cronbach, L. J. Further evidence on response sets 
and test design. Educ. Psychol. Measmt., 1950, 
10, 3-31. 

. Fehrer, Elizabeth, and Strupp, H. The effect of 
equating interest test items for prestige value. 
J. appl. Psychol., 1949, 33, 222-230. 

. Guilford, J. P. A simple scoring weight for test 
items and its reliability. Psychometrika, 1941, 
9, 67-81. 


. Kuder, G. F. Examiner manual for the Kuder Pref- 


erence Record—V ocational. Chicago: Science Re- 
search Associates, 1949. 


. McNemar, Q. Psychological statistics. New York: 


Wiley, 1949, 


7. Strong, E. K., Jr. Vocational interests of men and 


women, Stanford: Stanford University Press, 
1943. 

. S. Employment Service. Dictionary of Occupa- 
tional Titles, Part I. Washington, D. C.: U. S. 
Government Printing Office, 1939. 

J. S. Office of Education. Biennial Review of Edu- 
cation. Washington, D. C.: U. S. Government 
Printing Office, 1946. 








Communication, Supervision, and Morale 


C. G. Browne 
Wayne University 
and 
Betty J. Neitzel 
National Bank of Detroit 


This study was concerned with the estima- 
tion and communication of responsibility, 
authority, and delegation of authority by 
three supervisory levels of female employees 
in a utilities company. Comparisons will be 
made between the communication of the 
three factors and the attitudes of the super- 
visory employees toward company personnel 
policies. 

While in many cases management may 
believe that it has established specific respon- 
sibilities and authorities for given positions, 
and that they are co-equal, it is important to 
determine whether or not all levels of the 
organization have communicated the thinking 
of management in an understandable and 
acceptable manner. Communication is a proc- 
ess that takes place throughout the entire 
organization between all individuals and 
departments, in a flow both inwardly and 
outwardly through all echelons.' 


Procedure 

The subjects for this study were a group of 
female employees at three supervisory levels 
selected from eight offices of a Michigan 
utilities company. The three supervisory 
levels will be designated A, B, and C for 
purposes of this report. Level A was the 
inner level of the three and functioned in a 
supervisory capacity to level B; level B 
supervised level C; and level C supervised a 
non-supervisory level not included in the 
study. An office from two districts of each 
of the four divisions of the company was 
included. Districts 1 and 2 represented 
Division I; Districts 3 and 4, Division IT; 
Districts 5 and 6, Division IIT; and Districts 
7 and 8, Division IV. 


1 For an explanation of inner and outer as contrasted 
with upper and lower management levels, see Browne 


86 


The R, A, and D Scales’? developed by 
Stogdill and Shartle (8) were used to obtain 
estimates of responsibility, authority, and 
delegation of authority. The method used in 
constructing the scales has been described by 
Browne (2). To measure employee attitudes, 
the morale scale devised by Harris (5) was 
used. This scale consists of 36 items, each 
having a discrimination value of 1.0 or higher. 
Five items which were not applicable to the 
utilities company were eliminated, leaving 31 
items which were used in this study and which 
provided a maximum score of 45.60 (the sum 
of the discrimination values of the 31 items). 
A total of 117 sets of forms were mailed to 
the divisions. Of these, 100 sets or 86 per cent 
were completed and returned directly to the 
authors. The completed forms included & 
level A supervisors; 26 level B supervisors; 
and 66 level C supervisors. 


R, A, and D Scores 


The R, A, and D scores represent the 
person’s estimates of her responsibility, author- 
ity, and delegation of authority. The mean 
R, A, and D scores for each of the three 
supervisory levels are given in Table 1. 

Since the lower scores indicate a higher degree 
of the factor measured, the mean scores of 
3.61, 3.85, and 4.64 for the level A supervisors 
represent the highest estimates for R, A, and 
D, respectively. The mean scores of the level 
B supervisors represent the next highest, 
while the mean scores for level C supervisors 
represent the lowest.* The trend of the mean 

* Persons interested in information regarding the R, 
A, and D Scales may contact Dr. Ralph M. Stogdill, 
Associate Director, Personnel Research Board, The 
Ohio State University, Columbus 10, Ohio. : 

’ For clarity, the quantitative interpretation of R, A, 
and D scores will not be used in the following discussion. 
Instead, a qualitative interpretation will be used, so 


that a discussion of a high R score, for example, will 
represent an estimate of a high degree of R and a dis 





Communication, Supervision, and Morale 87 


Table 1 


Mean R, A, and D Scores* 


Mean 
A Score 
3.85 
4.52 5.05 
4.81 5.54 


Mean 
R Score 
3.61 
3.82 
3.87 


Supervisory 
Level N 


Mean 
D Score 
Level A 8 
Level B 26 


Level C 66 


Total Group 100 3.83 4.06 5.34 


* The range of possible scores on each scale is 1.0 to 
8.7. It is important to note that the lower quantitative 
scores indicate a higher degree of the factor measured, 
while the higher quantitative scores indicate a lower 
degree. 


scores indicates that the subjects estimated 
the degree of their responsibility, authority, 
and delegation of authority in relation to their 
position in the company. That is, the closer 
the supervisory level of a group was to the 
focal point (3) of the organization, the higher 
the estimates of each of the factors was. 
This trend also was supported when the data 
were studied by 
individuals. 

For the total group and for each supervisory 
level, the figures in Table 1 also indicate that 
responsibility was estimated to be the greatest 
of the three factors, followed by authority and 
delegation of authority, as evidenced in the 
total group mean scores of 3.83, 4.66, and 5.34, 
respectively. With the exception of one 
district, this was consistently true when the 
data were analyzed by divisions and districts. 
The range of the individual scores was from 
2.72 to 6.78 for R; 2.82 to 6.98 for A; and 2.90 to 
7.55forD. The mode for the R scores was 4.0; 
for the A scores, 4.6; and for the D scores, 5.5. 
Here again, the same relationship is observed. 
For the individuals, 10 of the 86 cases estimated 
authority to be greater than responsibility, 
but the remaining 76 followed the trend of the 
total data in estimating responsibility to be 
greater than authority. 

The data, then, demonstrate that these 
supervisors did not estimate their responsibility 
and authority to be equal, as might ideally be 
expected. The product moment inter-correla- 
tions between the three factors were: R and 


divisions, districts, and 


cussion of a low R score will represent an estimate of a 
low degree. 


A=.24; R and D=—.03; A and D=.22. 
These coefficients indicate some tendency for 
persons with high responsibility estimates to 
estimate authority high also, and for those with 
high authority estimates to estimate greater 
delegation of authority. However, the rela- 
tionships were not as high as reported in two 
previous studies. In the Ohio State Leader- 
ship Studies, unpublished correlations for a 
group of Naval officers were found to be .56 
for R and A; .16 for R and D; and .86 for A 
and D. Browne (2) in his study of business 
executives reported correlations of .56 for 
Rand D; .29 for R and D; and .54 for A and D. 
It will be noted, however, that the correlations 
in the three studies indicate the same general 
trend since the correlations between R and A 
and between A and D were larger throughout 
than the correlation between R and D. The 
variation in the size of the coefficients may be 
regarded as a function of the variation in the 
groups and the situations in which they were 
operating. 

If R and A were judged to be equal or if 
each person estimated them in the same 
proportionate relationship, the correlation 
between them would be 1.00. The extent to 
which the relationship deviates from this 
perfect, ideal relationship may be dependent 
upon two general variables: (1) the effective- 
ness of communication between supervisory 
levels; or (2) the clearness and specificity with 
which management has defined responsibility 
and authority for each supervisory level in the 
organizational set-up. Considering the com- 
parative figures given above, it would appear 
that these variables as represented by R and A 
scores were more satisfactorily understood and 
communicated in the military situation and 
in inner management levels than they were in 
the present situation which studied outer 
management on the first, second, and third 
levels of supervision. 

A correlation coefficient of unity between A 
and D would indicate that all persons estimated 
their delegation of authority equally in re- 
lationship to their estimates of authority. 
Obviously this perfect relationship need not 
be expected. In fact, it might indicate an 
undesirable condition within the organization. 
However, the extent to which individuals 
believe they are delegating authority may be 





88 C. G. Browne and Betty J. Neitzel 


studied from the size of the correlation. In 
each situation further research would be 
needed to determine what the most desirable 
relationship should be. 

There is no reason to believe that the individ- 
ual’s estimate of responsibility should be 
related to his delegation of authority. Theo- 
retically at least, while authority can be 
delegated, responsibility cannot, since an 
individual is always responsible to inner 
management levels for the responsibilities 
which have been assigned to him. The 
correlation, then, between R and D has little 
working meaning, although the lack of any 
necessary relationship between these two 
variables is supported in all three of the studies 
reported since the R and D correlation in 
each case is the lowest of the three. 

R, A, and D Disparity Scores 

In order to study the effectiveness of the 
communication of responsibility, authority, 
and delegation of authority, some measure of 
communication of these factors between the 
three supervisory levels was needed. For this 
purpose, disparity scores were used which 
represented the differences between the in- 
dividual’s estimations of R, A, and D for 
herself and the estimates of her supervisor or 
assistants, as appropriate, of the three factors 
for the individual. 

Since responsibility and authority flow from 
supervisor to assistants, supervisors in levels 
A and B completed scales to estimate the 
responsibility and authority of their immediate 
juniors, who were levels B and C, respectively. 
The scores of these scales were designated “‘r’”’ 
and “a’’ scores. Since an individual’s author- 
ity is delegated to her by her supervisor, 
subjects in levels B and C completed scales to 
estimate the delegation of authority of their 
immediate seniors, who were levels A and B, 
respectively. The scores of these scales were 
designated ‘‘d”’ scores. 

As an example, the R disparity score, then, 
for a level B supervisor is the difference 
between the R score of the level B supervisor 
and the “r’’ score of her level A supervisor. 
Thus, the R disparity score represents the 
difference between the level B supervisor’s 
thinking regarding her responsibility and the 
thinking of her supervisor. In this way, 


the R disparity score serves as a measure of 
the communication of responsibility between 
adjoining levels of supervision.* On the same 
basis, the A disparity score for a level B 
supervisor is the difference between her A 
score and the “a” score of her level A super- 
visor. In this study, the disparity score was 
used without consideration of the algebraic 
sign. However, interest in another study well 
may be in the direction of the difference, and 
in this case the algebraic sign may be used. 

Whereas R and A disparity scores can have 
only one value since they depend on the 
estimations of only two individuals, the D 
disparity score of an individual may have as 
many values as she has people under her 
supervision. For the purposes of this study, 
a composite disparity score was used, calculated 
in the following manner. In the case of a 
level B supervisor, the difference between 
her D score and the “d”’ score of each of her 
level C assistants was determined. The mean 
of these differences is the D disparity score 
for the level B supervisor. 

The mean R disparity, A disparity, and D 
disparity scores for the total group were .81, 
.77, and .36, respectively, while the medians 
were .72, .52, and .61. The range was 0.00 
to 2.52 for R disparity; 0.00 to 2.83 for A 
disparity; and .22 to 2.73 for D disparity. 
The medians give a more accurate picture of 
the results since the distribution had some 
extreme scores and did not yield a normal 
distribution. 

Although the differences in median disparity 
scores were not great, they indicate a tendency 
for R disparity scores to be greatest, followed 
by D disparity and then A disparity, but the 
A and D disparities are reversed in order when 
the means are considered. In no case did the 
three R, A, and D scores of any person agree 
with the three “r,” “a,” and ‘“d” scores 
obtained from her supervisor and assistants. 
Since disparity scores are a means of stating 
quantitatively the extent of disagreement 
between a person’s estimate of the factor 
measured and the estimate of her supervisor 
or assistant of the same factor for the same 
individual, they constitute a measure of 


‘It should be noted that disparity scores as described 
here can be used only between adjoining levels of super- 
vision. 





Communication, Supervision, and Morale 


communication between the individuals. If 
communication between supervisory levels is 
complete, the responsibility and authority an 
individual believes he has should agree with 
his supervisor’s estimates of his responsibility 
and authority, and the individual’s estimate 
of authority delegated to his assistants should 
agree with the assistants’ estimate of what 
the senior has delegated. Differences in these 
agreements are revealed by disparity scores, 
the size of the score being a function of the 
difference in thinking between supervisory 
levels. 

Correlation coefficients were obtained between 
disparity scores and Harris morale scores, and 
between the deviation of the individual R, A, 
and D scores from the mean R, A, and D 
scores of individuals in the same job. The R 
mean deviation score will be used to illustrate 
the methods of obtaining the deviation scores. 
The mean R scores of level A, level B, and 
level C supervisors were determined. The 
R mean deviation score of each level A 
supervisor is the difference between her 
individual R score and the mean R score of 
all level A supervisors. The same procedure 
was followed for the other supervisory levels 
for R mean deviation and for A and D mean 
deviation scores for each supervisory level. 
Thus, the mean deviation scores are measures 
of the extent to which the individual’s esti- 
mations of R, A, and D in her own specific 
position are at variance with the mean 
estimation of R, A, and D of all individuals 
included in the study doing her particular job. 

In Table 2, the correlations between dis- 
parity scores, morale scores and the R, A, and 
D mean deviation scores are given. The 
coefficient of .56 between R mean deviation and 
R disparity and of .63 between D mean 
deviation and D disparity represent substantial 
relationships between these two variables. 
Although the correlation of .31 between A 
mean deviation and A disparity is smaller, it 
indicates the same trend. Thus, in ail of 
these correlations, the indication is that those 
individuals who deviated most in their estima- 
tions of the three factors from the estimates 
of their total job group (mean deviation score) 
also were the individuals who were at greatest 
variance with the estimates of their super- 
visors and assistants for the three factors 


Table 2 


Product Moment Correlations of R, A, and D Mean 
Deviations and Morale Scores with R, A, 
and D Disparity Scores 


R, A, and D 
Mean 


Deviations Morale 


—.54 
~.10 


Disparity N 


R disparity 92 56 
A disparity 92 31 


D disparity 34 .63 .06 


related to their position (disparity score). 
For example, an individual estimate of respon- 
sibility that was higher or lower than the 
mean responsibility score of the echelon to 
which the individual belonged was likely 
also to be higher or lower than the estimate of 
her responsibility by her supervisor. 

The correlation of —.54 in Table 2 indicates 
that individuals with high morale scores 
tended to be in closer agreement with their 
supervisors regarding their level of respon- 
sibility since this would make for low disparity 
scores. If it is accepted that the disparity 
score is a measure of communication between 
supervisory levels, then the present evidence 
regarding morale would support the concept 
that communication is one of the influencing 
factors in the determination of morale, 
particularly as related to the responsibility 
variable. 


Morale Scores 


Each morale score represents the attitude 
of an individual toward company personnel 
policies. The maximum score of 45.60 was 
obtained by one level A supervisor and three 
level B supervisors. The lowest score for 
the group was 17.95 for a level C supervisor. 
In six of the eight districts, the level A super- 
visor had the highest morale score, and the 
mean of the level B supervisor scores was 
higher than the mean of the level C supervisor 
scores. In two districts, both in the same divi- 
sion of the company, the level B supervisors had 
the highest mean score followed by the level C 
supervisors, and the level A supervisor score was 
the lowest. Generally, however, the morale 
score was positively related to the echelon level 
of the supervisors, the inner level supervisors 





C. G. Browne and Betty J. Neitsel 


Table 3 


Product Moment Correlations of Morale Scores 
with R, A, and D Scores* 


Morale Score N f D 
Level A** 8 13 
Level B 26 —, —.39 08 
Level C 66 : 10 


Total Group 100 05 O08 09 


* The sign for these correlations has been changed so 
that in interpreting the correlations a large score in one 
variable is also indicative of a large score in or a greater 
degree of the second variable. 

* The correlations for level A were computed by the 
rank-difference method. The coefficients obtained were 
converted into their equivalent Pearson r coefficients. 


having the highest scores. It may be noted 
that R, A, and D scores also were positively 
related to echelon level. 

Table 3 includes the correlations between 
morale scores and R, A, and D scores for the 
three supervisory levels and for the total 
group. For the total group there is little 
relationship between the variables as indicated 
by the correlations of .05, .08, and .09 for 
R, A, and D, respectively. However, for 
R and A correlated with morale scores for 


the inner levels A and B supervisors, the 
range of correlation coefficients was —.16 to 


—.47. There appears, then, to be a definite 
trend in the inner supervisory levels, partic- 
ularly for those who estimated responsibility 
and authority higher, to have lower morale 
scores. This was not the case, however, 
with the outer level C supervisors, there being 
little relationship between the variables for 
them as reflected in the coefficients of .16, 
.07, and .10. 


Summary 


This study was an investigation of the 
communication of responsibility, authority, 
and delegation of authority at three super- 
visory levels of a utilities company and included 
a study of employee morale in relation to the 
three factors. The R, A, and D Scales 
developed by Stogdill and Shartle and the 
Harris morale scale were used as measuring 
instruments. As one measure of communica- 
tion, a disparity score was used which repre- 
sented the differences between the individual’s 


estimates of R, A, and D for herself and the 
estimates of her supervisor in the case of 
R and A and the estimates of her assistants 
for D. 

The results of the investigation included the 
following: 


1. Individuals estimated their responsibility, 
authority, and delegation of authority in 
relation to their position in the company, those 
nearer the focal point of the organization 
having higher scores on all three variables. 

2. Responsibility and authority were not 
estimated to be equal, but most subjects 
believed their responsibility exceeded their 
authority. 

3. Disparity scores (the differences between 
the individual’s estimates of R, A, and D for 
herself and the estimates of her supervisor or 
assistants, as appropriate, of the three factors 
for the individual) produced no cases of 
agreement between individuals on varying 
levels of supervision, the amount of disparity 
being a measure of incomplete or unsatisfactory 
communication. 

4. There was a negative relationship between 
morale scores and disparity scores, this being 
particularly evidenced with R disparity scores 
which correlated —.54 with morale scores. 

5. Correlations of .56, .31, and .63 were 
obtained between the deviation of individual 
R, A, and D scores from the mean score of 
each supervisory level group and disparity 
scores for the three variables. 

6. Morale scores were found to be positively 
related to the echelon level. of the supervisors, 
the inner level supervisors generally having the 
highest scores. In the inner supervisory levels 
there was a trend as indicated in correlation 
coefficients ranging from —.16 to —.47, for 
those who estimated responsibility and author- 
ity higher to have lower morale scores. 


Received May 14, 1951. 


References 

1. Barnard, C.1. The functions of the executive. Cam- 
bridge: Harvard University Press, 1947 

2. Browne, C. G. Study of executive leadership in 
business. I. The R, A, and D Scales. J. appl. 
Psychol., 1949, 33, 521-526. 

3. Browne, C. G. The concentric organization chart’ 
J. appl. Psychol., 1950, 34, 375-377. 





Communication, Supervision, and Morale 91 


. Guilford, J. P. Fundamental statistics in psychology 
and education. New York: McGraw-Hill, 1942. 
5. Harris, F. J. The quantification of an industrial 
employee survey. J. appl. Psychol., 1949, 33, 
103-111. 
. Jucius,M. Personnel management. Chicago: Rich- 
ard 1D. Irwin, Inc., 1947 


. Stogdill, R. M., and Shartle, C. L. 


7. Stogdill, R. M. Leadership, membership and organi - 


zation. Psychol. Bull., 1950, 47, 1-14 


Methods for 
determining patterns of leadership in relation to 
organization structure and objectives. J. appl. 
Psychol., 1948, 32, 286-291. 





Opinions on Communism of Air Force Police Trainees 


Major Norman E. Green 


Air University, Human Resources Research Institute 


In the summer of 1950 the security con- 
sciousness of the United States Air Force 
rose to a new high. Increased international 
tensions and the results of security vulnerabil- 
ity surveys at certain air bases showed that 
a vital need existed for greater protection 
against subversive and sabotage activity. 
This was particularly important at fighter 
interceptor bases where the USAF is charged 
with first line defense of the nation against 
hostile activity in the air. It was equally 
important at long-range bomber bases where 
our aircraft and crews must be instantly ready 
should a retaliatory air strike become necessary. 
The lack of an airtight security plan might 
feasibly result in a crippling blow before any 
effective reaction could be made. Part of 
the answer to this need was seen as the prompt 
training of an increased force of Air Policemen. 
A school for this purpose was established at 
Tyndall Air Force Base, Florida, and in 
September 1950 its doors were opened for the 
young airmen students. 


The Problem Situation 


The course centered around anti-sabotage 
measures and the development of proficiency 
in weapons and unarmed combat. Through- 
out the period of schooling, instruction of a 
motivational and informational nature was 
also presented. This included such subjects 
as career opportunities in the Air Police 
System and discussions of communism as a 
threat to the American way of life and to 
the security of the USAF. The psychological 
preparation of the airman for his new duty 
was not neglected nor made secondary to the 
physical preparation. This, of course, was in 
line with policies on troop information in 
general and was commensurate with the 
now-common knowledge that the best informed 
airmen are characterized by higher morale and 
efficiency. 

Three 45-minute periods were allotted for 
instruction on communism. The rationale 
behind this instruction was to present facts 


about communist activity as it is operating 
in the, world today and not to dwell upon 
discussions of political philosophy and what 
might be or could be. 

The first period served as an introduction 
during which the instructor described the 
threat of communist sabotage at vital Air 
Force bases and showed a film depicting the 
origin and growth of communism, its patterns 
of aggression and its subversive methods. 

The period for the second week was called 
“Communism in the United States.” With 
two instructors taking part, it was presented 
in question and answer form with several of 
the questions and comments coming from the 
students themselves. The following are typical 
of the questions discussed: What is communism? 
Has any nation ever gone communist in a free 
election? How do communists try to get 
control? Under communist rule: could I 
belong to a union; could I go to school; could 
I change my job; could I travel around the 
country as I please; could I teach what I 
want with “academic freedom”? How many 
communists are there in the United States? 
Where are their headquarters in the United 
States? What is the communist party set-up 
in the United States? What does one have 
to do to join the communist party? How do 
communists get control of organizations in 
which the majority are not communists? 

During the third week the discussion method 
was used in a similar manner for the subject 
of “Communism and Religion.” The follow- 
ing questions are typical of those considered: 
If communism should come to the U. S. could 
I belong to a church? What would the 
communists do to the churches and synagogues? 
What is the communist faith? Do the 
communists pretend to tolerate religion today? 
How would my child learn his religion? Who 
would own the churches? What is_ the 
“Peoples Institute of Applied Religion’’? 
How are priests and ministers treated under 
communist rule? 





Opinions on Communism of Air Force Police Trainees 


Throughout these three class periods empha- 
sis was put on the close tie-in between the 
success of the Air Police mission and the 
success of the whole Air Force mission. 


The Problem of Attitudes 


To obtain information on the depth and 
direction of the Air Police trainees’ attitudes 
toward communism was considered important 
for three reasons: (1) The results of such an 
inquiry would serve as an appraisal of certain 
learning outcomes; (2) these results would also 
provide data on the modifiability of attitudes 
in a school situation; and (3) some insight 
about the quality of the airmen’s psychological 
preparation for their responsibilities would 
be gained. Accordingly, the investigation of 
attitudes was undertaken. 


The Population 


The population used for the study included 
four classes of airmen Air Police students 
totaling 1,974 subjects. These airmen had 


been sent direct to Tyndall Air Force Base 
from the Indoctrination Wing at Lackland 
Air Force Base, Texas, where they had taken 
their basic recruit training during a stay of 


approximately four weeks. The new airmen, 
most of whom were high-school graduates, had 
voluntarily enlisted and had come from homes 
all over the country. Two classes totaling 
992 incoming students were used as the control 
group and two classes comprising 982 outgoing 
students served as the experimental group. 
Original selection procedures and qualification 
requirements (physical examinations and 
minimum AGCT score of 90) for Air Force 
service were the same for all subjects. All 
airmen were assigned to this training to fill 
the immediate need described above. The 
same instruction on communism and all other 
matters was given to all airmen in the popula- 
tion. In addition, as shown in Table 1, the 
control group and the experimental group can 
be considered alike in age and amount of 
formal schooling. 


Procedure 


The ten statements were composed by the 
writer to serve as the opinion yardstick. 
They were made purposefully strong in tone 
to provide opportunity for indication of 


Table 1 
Data on Age and Educational Achievement for Control 


Group (N = 992) and Experimental 
Group (N = 982) 


Control 
Group 
20.0 
11.8 


Experimental 
Group 
19.8 
11.6 


Mean Age 

Mean Years of School 

Per Cent Completed 7th, 
8th and 9th Grades Only 3.4 8.0 

Per Cent Completed 10th 
and 11th Grades Only 

Per Cent Completed High 
School Only 

Per Cent with some 
College Work 


19.3 
61.9 
10.8** 


* Includes 3 college graduates. 
** Includes 4 college graduates. 


intensity of opinion. As such, the statements 
do not reflect the tone of the instruction 
presented. For each item spaces were pro- 
vided for expressing “no comment,” “strongly 
disagree,” “disagree,” “agree,” and “strongly 
agree.” Complete anonymity of respondents 
was maintained throughout the study. 

The control group of 992 incoming students 
completed the survey schedule on the morning 
of the first day of classes before any instruction 
was given. The experimental group of 982 
outgoing students completed the form during 
the third week of the course after the third 
and final period of instruction on communism. 
The responses of the two groups were tabulated 
separately and the data were organized to 
show comparisons of the two groups on a 
before and after basis. Thus, each statement 
was analyzed by showing percentages of 
agreement and disagreement giving, by in- 
ference, a picture of the shifts and changes 
in direction and intensity of opinion. The 
“before” and “after” differences in feelings 
about communism were then tested for 
statistical significance. 

Results 

The results of the analysis are presented 
below for each statement individually. The 
figures are expressed in per cent and critical 
ratios are given to show the statistical signif- 
icance of differences for each response category. 





94 


Major Norman E. Green 


1. Communism is a plan to rule the world. 





Control Group 
Experimental Group 
Critical Ratio 


Under communism 


Control Group 
E-xperimental Group 
Critical Ratio 


A communist ts the 


Control Group 
Experimental Group 
Critical Ratio 


4. 


Communists would 


Control Group 
Experimental Group 
Critical Ratio 


5. Communists are 


Coutrol Group 
Experimental Grouy 
Critical Ratio 


6.4 


Control Group 
Experimental Grou, 
Critical Ratio 








No 
Comment 


Strongly 
Disagree 
2.3 
8 
—2.8 


Strongly 
Agree 


50 


Americans would have to obey orders from the bosses or be pul in jail or shot. 


No 
Comment 


Strongly 
Disagree 


1.1 
8 


> oa 


Disagree Agree 
49 
29 

—9.5 


? 


0 
—3.3 


>» next thing to a gulter ral. 


No 


Comment 


Strongly 
Disagree 


4 


Disagree 

12 
4 
6 


—6.5 


19 
13 


- 44 


disobey the Almighty rather than disobey their leader. 


No 
Comment 


1 


Strongly 
Disagree 
1 1 3 
4 1 
5 


a. 


Disagree 


1 
—2.2 


eager to destroy all our airplanes. 


No 


Comment 


Strongly 
Disagree 
1.4 10 
) : 4 2 
—11.8 —1.4 —7.8 


Disagree 


? 


ommunist would give up his wife before he'd give up the party. 


No 
Comment 


Strongly 
Disagree Disagree 

2 9 
) 0 6 


a - ~ 


5 
4 
3 


Strongly 
Agree 
44 
67 
10.3 


Strongly 
Agree 
36 
60 


Strongly 
Agree 
HH 
66 
9.3 


Strongly 
Agree 
28 
67 


17.5 


Strongly 
Agree 
28 
41 
6.0 





Opinions on Communism of Air Force Police Trainees 


7. Communisis want to have more children to help build up the Red army. 





No Strongly 
Disagree 


Comment 
Control Group 
Experimental Group 
Critical Ratio 


Strongly 
Disagree d Agree 
27 
48 
98 


8S. Communists would like to blow up every church they could. 


No Strongly 
Disagree 


Comment 
Control Group 2 
Experimental Group 
Critical Ratio 


Strongly 
Disagree Agree 
5 35 
38 Q 34 
9 8 O4 


9, The communists have plans to take over all of the United States. 


No Strongly 
Disagree 


Comment 


Control Group 1 
Experimental Group 12 
Critical Ratio - 29 


Strongly 
Disagree i : Agree 
2 39 
0 54 
2.0 3. 69 


10. A communist would wipe his feet on the American flag before he'd salute i 


No Strongly 
Disagree 


Comment 
Control Group 14 
Experimental Group 13 
Critical Ratio 


Discussion of Results 


An examination and study of the data make 
certain facts eminently clear and provide a 
basis for further inferences. 

On all ten statements about communism, 
an overwhelming majority of this sample of 
airmen were in agreement when they entered 
the school. This agreement ranged from a 
majority of 65 per cent on item 8 (Communists 
would like to blow up every church they could) 
to a majority of 93 per cent on item 2 (Under 
communism Americans would have to obey 
orders from the bosses or be put in jail or 
shot), the average agreement being 77.5 per 
cent. 

Following the third period on communism, 
the majority agreement had increased to a 


Strongly 
Disagree d . Agree 
42 
58 


6 5 
3 
3.6 7 6.7 


a 
3 


range of from 75 per cent on item 6 (A com- 
munist would give up his wife before he’d 
give up the party) to a majority of 97 per cent 
on item 1 (Communism is a plan to rule the 
world), the average agreement now being an 
87.7 per cent. This change was reflected 
principally but not entirely in a shift in 
intensity of opinion from “agree” to “strongly 
agree” and is statistically significant at the 
.01 level of confidence. 

Again considering all ten statements, the 
number of persons in the “ne comment” 
category was significantly reduced by 7 per 
cent, and those who originally strongly 
disagreed or disagreed were in a significantly 
smaller minority (2.8%) after the school 
experience. The pattern of these changes in 


SEERA Se FAAS EISS 


TS tawedint ok 





96 Major Norman E. Green 


direction and intensity of opinion was con- 
sistently positive and vectored to the categories 
of agree and strongly agree. 

It is somewhat revealing if not surprising 
to find such a consensus of anti-communist 
feeling within this population of American 
youth. However, this general attitude may 
be at least partially accounted for by the 
fact that the group had already been motivated 
in this direction to the point of voluntary 
enlistment in the service. Thus, the strength 
of the attitude is mirrored in the decision to 
“join up.” Whatever may be the true 
similarity between these opinions and those 
of the nation as a whole, the suggestion is 
inescapable that Americans may have much 
stronger feelings on this entire issue than their 
leaders imagine. 

This study supports other investigations 
showing that a planned program of information 


can result in definite shifts of attitudes. If it 
is desired that a population be better prepared 
psychologically to meet an aggressor, a simple 
plan for the communication of ideas will go 
a long way toward doing the job. 

From the standpoint of their preparation 
attitude-wise for their responsibility for security 
and protection within USAF installations, 
these airmen may be said to be well-equipped. 
This is true not only because the attitudes 
exist, but because they exist in strength and 
intensity. Attitudes are characterized by a 
behavior or action component. This might 
extend from a mere inclination to vote “yes” 
to a most forceful or even sacrificial reaction. 
The anti-communist feelings of these air 
policemen, therefore, imply a definite readiness 
to respond appropriately to certain persons and 
situations. 


Received May 3, 1951. 





Studies in Job Evaluation. 


9. Validity of a Check List for 


Evaluating Office Jobs * 


Minnie Caddell Miles 
Occupational Research Center, Purdue University 


For a number of years increasing emphasis 
has been attached to the study of job evalua- 
tion. World War II gave an impetus to this 
movement. Almost overnight industrial 
managers all over the country became more 
keenly aware than ever before of the tremen- 
dous need for adequate methods for evaluating 
jobs. This need continues to exist today. 
Particularly is this true in the area of office 
jobs. In spite of the vast amount of material 
published on the progress of job evaluation 
programs, a relatively small amount pertains 
to the office. As late as 1945, according to 
Ells (5), few companies had classifications for 
office employees other than for a few clerical 
and stenographic jobs. In 1949 a survey by 
the National Office Management Association 
revealed that only 32 per cent of the companies 
reported office job evaluation plans (13). 

The difficulties of studying office jobs, 
because of the lack of standardization and the 
multiplicity of duties sometimes performed 
by a single job incumbent, are no doubt 
partially responsible for this fact. In addition, 
one is faced with intangibles which are difficult 
to measure; with numerous jobs which cannot 
be put on a measured production basis; plus 
the fact that few clerical jobs remain the 
same over a period of time (9). And, by no 
means a minor factor is the difficulty in 
determining dependable market rate com- 
parisons (2). However, these problems em- 
phasize all the more the need for careful 
study. 

A possible approach for overcoming many 
of these problems is that of a check-list. It is 
the purpose of this study to determine the 
validity of a Job Description Check-List for 
the evaluation of office jobs. The underlying 
assumption is that by using paired-comparison 

* This paper is based upon a thesis submitted to the 
Graduate Sch School of Purdue University in partial fulfill- 
ment of the requirements for the — of Doctor of 


Philosophy, January, 1951 (10a). e research was 
done under the direction of Dr. C. H. Lawshe. 


job ratings as criteria, we can determine the 
validity of the check-list device. 


Background of Problem 


Two different groups have devoted research 
to the use of job elements in the study of 
office jobs. The first such research was by 
members of the Life Office Management 
Association, and the second was under the 
sponsorship of the Occupational Research 
Center, Purdue University. 

Job Elements as a Basis for Office Job 
Evaluation. The Life Office Management 
Association, having a fertile field for research 
among the thousands of office employees in 
the life insurance business, has for a number 
of years carried on a continuing project in the 
evaluation of office jobs. As a result, the 
Clerical Salary Study Committee of this 
organization developed, or was responsible for 
the development of, what is known as the 
Job Element Evaluation Plan. Under this 
plan 149 clerical operations were distinguished 
and their relative values determined. It was 
felt that with such a plan the writing of job 
descriptions would be more accurate, job 
comparisons more valid, and the comparison of 
wages and salaries among different companies 
facilitated. A detailed discussion of this plan 
is to be found in Clerical Salary Administration 
(6). 

Development of an Office Job Description 
Check-List. In 1947, at Purdue University, 
Culbertson set out to determine the adequacy 
of an operational check-list for the description 
of clerical jobs (3). From his personal experi- 
ence and from a survey of the literature he 
identified the basic operations which constitute 
clerical activity. After trial verifications of 
the items and experimental tryout of the 
check-list, it was concluded that it proved 
adequate for describing clerical jobs. 

Revision and Application of the Job Descrip- 
tion Check-List. As a further development of 


ee eee 


hare 








98 Minnie Caddell Miles 


the check-list approach, Dudek (4) attempted 
to devise a job evaluation plan which would 
be relatively simple and easy to grasp, as well 
as easily administered. He hypothesized that 
these aims could be achieved by: (1) identifying 
all tasks or operations involved in a class of 
jobs; (2) evaluating these operations on a 
relative scale; and (3) evaluating each job in 
terms of the relative amount of time spent on 
each task. 

Using Culbertson’s (3) original check-list 
as a basis, Dudek revised the items, and deter- 
mined a scale value for each. This check-list 
was tried out with a group of some 150 office 
employees in a radio plant. It was concluded 


that the check-list adequately described the 
tasks of the office workers in the study. 
Further research was suggested, however, for 
demonstrating the adequacy of such an 
instrument for job evaluation purposes. 


Procedure and Results 


The Job Description Check-List of Office 
Operations' used in the present study is a 
slightly revised form of the one developed by 
Culbertson and Dudek. 

Procurement of Data. Each of the three 
companies from whom basic information was 
obtained is engaged in a different type of 
operation from that of the others, as well as 
being located in a different section of the 
country. A foundry, located in the South, 
supplied part of the data. Another contributor 
was a manufacturer of office filing supplies 
located in the East, and a third was a member 
of the automotive industry located in the 
Midwest. Data for cross-validation purposes 
were obtained from two steel mills, one in the 
Midwest and the other in the East. 

For purposes of convenience, these partici- 
pating concerns will be referred to throughout 
the study as: Company I, the foundry; 
Company II, the filing concern; Company ITI, 


1 The detailed check-list has been filed with the 
American Documentation Institute. Order Document 
3267 from American Documentation Institute, 1719 N 
Street, N.W., Washington 6, D. C., remitting $1.00 for 
microfilm (images 1 inch high on standard 35 mm. 
motion picture film) or $1.20 for photocopies (6 X 8 
inches) readable without optical aid. Printed copies 
may be obtained by writing to Dr. C. H. Lawshe, 
Occupational Research Center, Purdue University, 
Lafayette, Indiana. 


the automotive plant; and Companies IV and 
V, the steel mills. 

Key Office Jobs from Cooperating Concerns. 
Each of the companies was asked to submit a 
list of 25 key office jobs according to suggested 
criteria. The key jobs were to be distributed 
throughout the entire range at present pay 
rates; were to “sample” the various areas of 
work being performed; should not be in 
dispute in regard to pay rates; and should be 
relatively well known by at least four or five 
people who were qualified to rate them. 

Companies II, III, IV, and V submitted 
lists of 25 key jobs, but Company I had only 
15 jobs which met the criteria. However, the 
final number of check-lists used were 14 for 
Company I, 25 for Company II, 20 for Com- 
panies III and IV, and 24 for Company V. 
Company II actually supplied 43 check-lists, 
which included one for each job incumbent in 
each key job; but, in instances where more 
than one check-list was prepared for a job, 
the scale values were averaged to arrive at the 
mean values used in the computations. The 
number of check-lists for some of the companies 
was reduced from the original number of key 
jobs because of the discontinuance of the job 
before the check-list phase of the study was 
completed. In other instances insufficient 
data were given to enable the inclusion of the 
check-list. 

Paired-Comparison Ratings of Key Jobs by 
Selected Judges. Upon receipt of the lists of 
key jobs, IBM cards for paired-comparison 
ratings were mailed to each company. These 
cards were marked independently by five 
raters in Company I and by four raters in each 
of the other companies. Typewritten instruc- 
tions for the raters accompanied each set of 
cards. 

Check-Lists Completed for Key Jobs. As 
soon as the paired-comparison rating cards 
were received from a company, the check-lists 
were mailed with accompanying instructions 
for completion. Each job incumbent and 
his immediate supervisor were requested to 
check independently the duties performed on 
the job. A third party, usually the coordinator 
of the research within the company, compared 
the two and identified any points of difference. 
A conference was then held with the incumbent 





Studies in Job Evaluation 


and the supervisor in order to reach an agree- 
ment on these differences. 

After agreement had been reached, a con- 
ference was held with the supervisor alone to 
determine which of the operations were 
considered most important to the job. Com- 
panies I and III indicated the operations 
judged most important by marking them 
1, 2, 3, and on through 10, in the order of their 
importance. Companies II, IV, and V in- 
dicated the most important operations by 
marking the five operations judged most 
important ‘‘A,”’ the five second in importance 
“B,” and the third five “C.” 

Analysis of Data. The first step was to 
determine the reliability of the judges’ paired- 
comparison ratings of the key jobs, which 
ratings were to be used as criteria for determin- 
ing the validity of the check-list. 

Correlations of Judges’ Ratings—Used as 
Criteria. The average of the intercorrelations, 
obtained by means of Fisher’s Z transforma- 
tions (7), ranged from .79 for Company III 
to .93 for Company V. 

By means of Shen’s formula (10) the reliabil- 
ity of the judges’ ratings was determined. 
The relatively high intercorrelations, as well 


as the resulting reliability figures, led to the 
conclusion that all judges should be included 
in the criterion measures with the exception 
of one judge for Company IV. 

For each job in a particular company, the 
average of the paired-comparison ratings of 
the judges was used as a criterion value. 


The reliabilities of these pooled ratings 
(criterion values) for the five companies, 
estimated by means of the Spearman-Brown 
formula (11) from the average intercorrelations 
referred to above, were as follows: I, .95; IT, 
.94; IIT, .94; IV, .95; and V, .98. 

With criterion values as reliable as these, 
it was possible to make a comparison between 
the scale values of the check-list operations 
marked for a particular job and the value 


assigned to this same job by the judges.. 


Scale values had been previously assigned to 
the check-list operations by a group of 
experienced managerial judges during Dudek’s 
study. The problem in the present study was 
one of finding ways of combining check-list 
values to obtain optimum agreement with the 
criterion values. 


r 
3 


COEFFICIENT OF CORRELATION 








NUMBER OF OPERATIONS USED 


Fic. 1. Correlations of the means of the highest 


scale values with the criteria. 


As indicated earlier, Companies IV and V 
were to constitute “hold-out” groups. 
sequently, attempts to derive weighting 
schemes were confined to Companies I, II, 
and III. 

Correlations of Check-List Operations with the 
Criteria. It been suggested that a 
relatively small number of the highest level 
operations performed by a job incumbent 
actually account for the over-all job level. If 
this is true, some scheme could be devised to 
isolate these critical operations and to give 
them a much greater weight than is given to 
the more or less routine types of activity that 
may be performed by nearly all office em- 
ployees. 

An attempt was made to identify these 
critical operations by selecting, from the ten 
operations judged most important by the 
supervisor of the job, the single operation 
having the highest scale value. Correlations 
computed for each of the three companies 
between these single scale values and the 
criterion ratings ranged from .51 to .74, with 
an average of .66. These values are shown 
graphically in Figure 1. Similarly, the mean 
scale value of the two operations with the 
highest scale values for each job was computed, 
correlated with the ratings, and the results 
plotted in Figure 1. This process was con- 
tinued, using the highest three, the highest four, 
etc., until ten operations were used. To 
complete the picture, the mean scale value of 
all operations marked for each job was com- 
puted, the correlations determined, and plotted 
in the figure. 


Con- 


has 








s 


COEFFICIENT OF CORRELATION 








'23458678690 24 
NUMBER OF OPERATIONS USED 


Fic. 2. Correlations of the means of the operations 
judged most important with the criteria. 


It will be noted in Figure 1 that there is a 
systematic increase up through five operations 
and that the curve then descends. In other 
words, in so far as one can generalize from these 
data, it appears that the means of the five 
operations with the highest scale values 


correlate higher with the criterion judgments 
than do the means of fewer operations, or more 
operations. 

Another attack: involved disregarding the 
magnitude of the scale values of the operations 


as such and considering the items in terms of 
their relative importance as estimated by the 
supervisors. For Companies I and III the 
procedure was identical to that described above 
except that the single operation used was the 
one judged most important; when two opera- 
tions were used they were the two judged most 
important, etc. This procedure was possible 
with these two companies inasmuch as the 
supervisors ranked the operations in the order 
of importance. However, since the supervisors 
in Company II were asked only to pick the 
five most important and the five next most 
important operations, it was possible to 
locate only two points on the curve in addition 
to the point representing all operations. As 
shown in Figure 2, the mean trend of these 
correlations is similar to the trend in Figure 1, 
in that there seems to be an optimum point 
around four or five operations where the 
correlation appears higher than it does when 
fewer or more operations are considered. 

The maximum average correlation in Figure 
1 is .79, whereas the maximum average 
correlation in Figure 2 is .84. It would be 


Minnie Caddell Miles 


difficult, if not impossible, to demonstrate a 
statistically significant difference between these 
two values, but in view of the fact that higher 
correlations were obtained by using those 
operations judged most important in contrast 
to those operations with the highest values, 
it seems justifiable to utilize the former as a 
basis for a weighting system. 

Having obtained correlations averaging .84 
between the mean scale value of the five 
operations judged most important and the 
criterion values of the jobs, the next question 
concerned the remaining operations. Would 
the utilization of the scale values of the 
remaining operations performed materially 
change the correlations? 

In an attempt to answer this question, 
the correlations of the five operations judged 
most important were considered as the A 
group; the next five in importance as group B; 
and all remaining operations were included in 
group C. These r’s obtained from correlating 
the means of groups A, B, an? © with the 
criteria were used for computin; shrunken 
multiple R for each company, tuiiowing the 
Wherry-Doolittle selection procedure (12). 
The shrunken multiple R’s with the inclusion 
of groups A, B, and C range from .78 for Com- 
pany I to .86 for Company III. In each of 
the three companies a smaller shrunken 
multiple R resulted from the inclusion of all 
three groups. Essentially no increase was re- 
alized from the inclusion of any group other 
than A. The use of groups A and B in Com- 
pany III changed the R only from .863 to .866, 
whereas in Company II the use of groups A and 
C changed the R from .795 to .788. Other com- 
binations of operations failed to yield higher 
correlations. 

Cross-Validation on ‘Hold-Out” Groups. 
From the r’s obtained by correlating the means 
of the five most important operations with 
the criteria, and the means of all other opera- 
tions correlated with the criteria, a multiple- 
regression equation was computed for each 
company (8). These regression equations 
further indicated that little, if anything, is 
added by the inclusion of operations beyond 
the five considered most important by the 
supervisors. However, although certain slight 
statistical advantages might accrue from 
using only the five most important operations, 





Studies in Job Evaluation 


Table 1 


Multiple Correlations and Correlations Derived by 
Generalized Regression Equation * 











Company Jobs 


I 14 
II 25 
Ill 20 
IV7 20 
Vt 24 








* Computed from the mean values of the five opera- 
tions judged most important and the mean values of all 
other operations. 

t “Hold-out” groups. 


it is felt that these advantages are counter- 
balanced by the improved employee relations 
which might result from the inclusion of all 
operations checked as a part of a particular job. 
Consequently, a generalized regression equa- 
tion, X:=4X2+X3, wassetup. This equation 
gives four times as much weight to the mean 
values of the five operations judged most 
important as is given to the mean values of 
all other operations. While this gives greatest 
emphasis to the five most important opera- 
tions, it is felt that it likewise gives adequate 
recognition to all other operations included in 
the job. 

This regression equation was applied to the 
data from the two steel mills which constituted 
the “hold-out” groups. The resulting r for 
Company IV was .890 as compared with the 
shrunken multiple R of .884 obtained by 
correlating the mean of the five most important 
operations with the criteria, plus the mean of 
all other operations. For Company V the r 
which resulted was .867 as compared to the 
shrunken multiple R of .884. It will be 
noted from Table 1 that the results were 
equally comparable when the generalized 
regression equation was applied to the three 
companies which furnished the basic data. 
These results appear to be sufficient proof of 
the adequacy of the generalized equation, as 
well as of the validity of the check-lists. 


Summary and Conclusions 


The purpose of this study was to determine 
the validity of a Job Description Check-List 
for evaluating office jobs. A revision of the 


101 


check-list developed at 
was used in the study. 
Data were obtained from five companies. 
Key office jobs were rated on a paired-compari- 
son basis by selected judges; check-lists were 
completed for the key jobs; and the ten 
operations judged most important to the 
job were indicated. Judges’ ratings were 
used as criteria. After computing zero-order 
r’s and shrunken multiple R’s, a generalized 
multiple-regression equation was set up for 
cross-validation purposes on data from .two 
“hold-out” groups. 
The following conclusions may be drawn: 


Purdue University 


1. The judges’ ratings, which were used as 
criteria, had high reliabilities. 

2. The five operations judged most import- 
tant toa job appear to be the optimum number 
for evaluation purposes. Neither the zero- 
order r based on more operations nor the 
shrunken multiple R computed from various 
groupings of operations yields a significantly 
higher relationship. 

3. For the promotion of good employee 
relations, it is considered advisable that all 
operations be included, with the five most 
important operations being given most weight. 

4. The generalized regression equation, 
X,=4X,.4-X;, when applied to the two ‘“hold- 
out” groups predicted the criterion values 
almost as well as did the regression equations 
derived directly from the data. 

5. Within the limits of this study, the Job 
Description Check-List of Office Operations 
appears to be a valid instrument for evaluation 
purposes. 


Received May 7, 1951. 


References 


1. Bellows, R. M., and Estep, M. Frances. Job 
evaluation simplified; the utility of the occupa- 
tional characteristics check-list. J. appl. Psy- 
chol., 1948, 32, 354-359. 

2. Burk, S. L. H. A case history in salary and wage 
administration. Personnel, 1939, 15, 93-129. 

3. Culbertson, A. L. The adequacy of an operational 
check-list for the general description of clerical jobs. 
Unpublished M.S. thesis, Purdue University, 
1947. 

4. Dudek, E. E. An operational approach to the evalu- 
ation of office jobs. Unpublished Ph.D. thesis, 
Purdue University, 1948. 











102 

5. Ells, R.W. Salary and wage administration. New 
York: McGraw-Hill, 1945. 

6. Ferguson, L. W. Clerical salary administration. 
New York: Life Office Management Association, 
1948. 

7. Fisher, R. A., and Yates, F. Statistical tables. 
(3rd ed.) New York: Hafner Publishing, 1949. 

8. Guilford, J.P. Psychometric methods. New York: 
McGraw-Hill, 1936. 

9. Kelly, Chalice. Job analysis a basis for payment 
according tooutput. A.M. A. Office Mgmt. Ser., 
1930, 53, 2-16. 


Minnie Caddell Miles 


10. Kelley, T. L. Fundamentals of statistics. Cam- 
bridge, Mass.: Harvard University Press, 1947. 
10a. Miles, Minnie Caddell. The validity of a job de- 
scription check list for evaluating office jobs. Un- 
published Ph.D. thesis, Purdue University, 1951. 

11. Peters C. C., and Van Voorhis, W. R. Statistical 
procedures and their mathematical bases. New 
York: McGraw-Hill, 1940. 

12. Stead, W. H., Shartle, C. L., and Associates. Oc- 
cupational counseling techniques. New York: 
American Book, 1940. 

13. Trends in office personnel problems. 
Rev., 1949, 38, 301-302. 


Anon. Mgmt. 





Specificity of Over- and Under-Achievement in College Courses 


William C. Krathwohl 


Institute for Psychological Services, Illinois Institute of Technology 


Some work has been done to measure the 
intangible traits of industriousness and indo- 
lence! in contrasting fields. The fields which 
have been investigated were those of math- 
ematics and of English by Krathwohl (3, 4, 
5, 6). A question which naturally arises is 
whether these traits carry over from one 
discipline to another, whether industriousness 
and indolence are general or specific and 
perhaps related to specific interest. 

One way to answer the question of the 
independence of work habits is to find a device 
for measuring industriousness in a field, and 
then to correlate such measures with similar 
measures in another field. 

Such a device, which will measure indus- 
triousness or indolence in a field, consists of a 
comparison of the scores received on aptitude 
and achievement tests. If the score of an 
individual on an achievement test in some 
subject is appreciably higher than his score 
on the corresponding aptitude test, such a 
person is defined as being industrious in that 
subject. If his scores are about the same, 
he is defined as being normal; but if his score on 
the achievement test is appreciably lower than 
his score on the aptitude test, he is defined as 
being indolent in that field. 

To make such scores comparable, the 
transformation of raw scores to standard 
scores (x/a) is the conventional procedure. 
In this investigation derived scores were used 
which have a mean of 20, a standard deviation 
of 4, and are rounded off to the nearest integer. 

The experiment to determine the independ- 
ence of the indexes of industriousness was set 
up with 308 second term sophomores who had 
taken the sophomore achievement tests in 
May 1948 at the Illinois Institute of Tech- 
nology. The tests were on English expression, 
chemistry, mathematics, and physics. These 

1 For conciseness and to avoid awkward construction, 
the word indolence as employed in this investigation is 
used not in a derogatory sense, but rather as a sub- 


stitute for under-achiever. In the same way, the word 
industrious is used as a substitute for over-achiever. 


tests were constructed by the Measurement 
and Guidance Project of the Educational 
Testing Service. The aptitude tests for chem- 
istry, mathematics and physics were form M 
of the Chemistry Aptitude, Mathematics 
Aptitude and Physics Aptitude Tests, respec- 
tively, which are published by the Bureau of 
Educational Research and Service of the 
University of Iowa. The aptitude test selected 
for English was the short fifteen minute 
vocabulary section of the Cooperative Reading 
Comprehension Test, Advanced Form, which 
is published by the Cooperative Test Service, 
now the Educational Testing Service. The 
reason for the selection of a vocabulary test 
as an English aptitude test is given in an 
article by Krathwohl (5). 

Correlations between aptitude and achieve- 
ment varied from 0.42 for mathematics to 
0.58 for English and all coefficients were 
statistically significant. 

The index of industriousness, briefly I.L., 
say for chemistry, for any student was defined 
to be his derived score on the chemistry 
achievement test minus his derived score on 
the chemistry aptitude test. Indexes of 
industriousness for the remaining three subjects 
were computed in a similar manner. Normal 
students were considered to be those students 
whose I.I.’s ranged from —2 to plus 2. These 
normal students constituted approximately 
the middle 50 per cent of the group and so could 
be classed as average in work habits in the 
sense that the word average usually is employed 
in psychology. Industrious students were 
defined to be those whose I.I.’s were equal to 
or greater than 3 and constituted approx- 
imately 25 per cent, or practically the upper 
quartile, of the 308 sophomores in the experi- 
ment. Indolent students were defined to be 
those whose I.I.’s were equal to or less than 
—3 and constituted approximately 25 per 
cent, or practically the lowest quartile, of the 
entire group. 

The four aptitude tests were taken in 


103 





William C. 


Table 1 


Correlations Between Indexes of Industriousness 


———= ——————— —— 


Indexes t 


1.3 
1.4 
2.3 
1.2 
1.6 
4.7 
2.5 


English with chemistry 
English with physics 
English with mathematics 
Chemistry with physics 
Mathematics with physics 
Mathematics with chemistry 
Mathematics with physics** 


* Significant at one per cent level. 
** Freshmen. 


September 1946 and the four achievement 
tests were taken in May 1948 so that one year 
and eight months elapsed between the taking 
of the aptitude tests and the taking of the 
achievement tests. During this period of 
time, these students had been exposed to the 
vicissitudes of college life, subjected to the 
temptations of extra-curricular activities and, 
in general, had had an opportunity to settle 
down to somewhat more steady study habits 
than they had in the beginning of their 
freshman year. One advantage of using 
tests almost two years apart is the elimination 
of the necessity for proving the persistence 
of the indexes of industriousness over a two 
year period. A second advantage is that the 
computation of I.I.’s over a two year period 
reflects the changes that may have occurred in 
a student’s work habits. 

That indexes of industriousness really meas- 
ured the effect of the industrious and indolent 
work habits for at least two of such diverse 
subjects as mathematics and English has 
been shown by Krathwohl (3, 4, 5, 6). 

The correlation coefficients between these 
various indexes of industriousness are shown 
in Table 1, where the first column gives the 
value of the correlation coefficients, and the 
second column gives the t-ratios. The popula- 
tion of each group is identical with that of the 
same named group in Table 2. 

It is evident from these correlation coefti- 
cients that those for the English L.I. with 
each of the remaining three indexes are so low 
that the conclusion cannot be drawn that 
industriousness or indolence in English implies 
the same type of work habits in the three 


Krathwohl 


remaining subjects. That is to say, a student 
who is industrious in English may or may not 
be industrious in mathematics, chemistry, or 
physics. 

In the case of the indexes of industriousness 
for mathematics and chemistry, the correlation 
coefficient between them is fairly small, 0.34, 
but is statistically significant at better than 
the 1 per cent level. 

In the case of the indexes of industriousness 
for mathematics and physics for sophomores, 
the correlation coefficient is small, and is not 
significant. However, the frequency, 37, of 
this group is small enough to cast doubts on 
the result. Hence, the entire procedure was 
repeated with 184 freshmen who had taken 
some locally prepared scholarship examinations 
in mathematics achievement and _ physics 
achievement. 

These 184 students later entered as freshmen 
and took the Iowa Mathematics Aptitude 
Test and the Iowa Physics Aptitude Test. 
The correlation coefficient, using these 184 
freshmen, between the I.I. for mathematics 
and the I.1. for physics turned out to be 0.18, 
which was not very different from the previous 
value of 0.26. However, the increased number 
of freshmen made this coefficient significant 
at almost the 1 per cent level. 

For further illumination on the independence 
of the various indexes of industriousness, the 
chi square method was resorted to, and the 
results are shown in Table 2. 

In Table 2 all the values of P, with the 
exception of the indexes of industriousness 


Table 2 


Values of Chi Square Between Various Indexes 
of Industriousness 








Degrees 
of 
Free- 
dom 


Chi 

N Square 
20 «68.45 
106 =. 2.98 
232 = 4.29 

63 = =2.40 

37S «0.011 
191 15.75 
184 = 5.65 


- 


Indexes 





English with chemistry 
English with physics 
English with mathematics 
Chemistry with physics 
Mathematics with physics 
Mathematics with chemistry 
Mathematics with physics* 





* Freshmen. 





Over- and Under-Achievement in College Courses 


for chemistry with mathematics, are so much 
larger than 0.01 that the conclusion can be 
drawn that they are independent of each other 
as shown by Lindquist (7). The value of P 
for mathematics I.I. compared with chemistry 
I.I., given as 0.01, is really less than that, and 
means that there is a relation between the 
I.I. for mathematics and the I.I. for chemistry. 
The existence of such a relation is borne out 
by the significant correlation coefficient of 
0.34 between indexes of industriousness for 
mathematics and for chemistry mentioned 
previously. However, this correlation is low 
enough for one to conclude that whatever 
relation exists must be a small one. 

The case for the indexes of industriousness 
for mathematics and physics is settled in 
Table 2. Here the high value of P, which 
equals 0.92 for the 37 sophomores, raises some 
question about the size of the sample of the 
population and suggests that it may be too 
small. When the larger sample of 184 fresh- 
men was used, the value of P dropped to 0.23, 
which is well within the range where independ- 
ence of the indexes of industriousness for 
mathematics and physics is assured. 

On the whole, it can be said that the indexes 


of industriousness for English, chemistry, 
mathematics, and physics are independent of 
each other with the exception of a slight unex- 
pected relationship between work habits in 


mathematics and chemistry. It should be 
noted that the mathematics I.I. and chemistry 
I.I. pair is the only one out of six possible 
pairs among English, chemistry, mathematics 
and physics which differs markedly from the 
other five. The correlation coefficient between 
the two indexes for mathematics and chemistry, 
although low, is the highest of the six and is 
the only coefficient which is statistically signif- 
icant. Furthermore, the mathematics-chem- 
istry pair is the only one between which the 
chi square test indicates a definite relationship. 
A situation of this type needs further investiga- 
tion. It is possible that such a relationship 
exists only among engineering students, be- 
cause engineering students are known to differ 
in some of their characteristics from liberal 
arts students and from some other professional 
groups, such as pre-law and pre-medicine, as 
was found by Fairbairn (1). Therefore, a 
study similar to this should be conducted on 


105 


liberal arts students and repeated on another 
group of engineering students. It is also 
possible that the relationship between the 
chemistry I.I. and the mathematics II. is 
due to the nature of the tests used to measure 
achievement in mathematics and in chemistry. 

An explanation of the ease of proving the 
independence between the I.I. for English 
and the I.I.’s for the sciences as compared with 
proving the independence of the I.I.’s among 
the sciences is seen by comparing the six 
inter-correlations among the four subjects. 
The correlations of English achievement with 
achievement in the three sciences, chemistry, 
mathematics, and physics, vary within the 
narrow range of 0.29 and 0.32, whereas the 
inter-correlations between achievements in the 
three sciences vary within the narrow range of 
0.54 and 0.56. All correlation coefficients are 
statistically significant. Whatever  correla- 
tions exist at all between English and the 
sciences are probably due to a common factor 
associated with intelligence. The larger cor- 
relations between the sciences undoubtedly 
are due also to communality of subject matter, 
and probably it is this communality of subject 
matter which explains the difficulty in proving 
the independence of indexes of industriousness 
among the sciences. 

From this investigation it can be concluded, 
certainly for students in an engineering school 
and undoubtedly for others, that indexes of 
industriousness are specific instead of general 
because there exists at least a set of four 
subjects; English, chemistry, mathematics, 
and physics, which either are independent of 
each other or, if dependent, have only a small 
relationship. Hence, there is sufficient evi- 
dence to say that a student should not be 
considered industrious, as such, but rather 
that he is industrious in mathematics or 
English or whatever the field may be. Neither 
should he be considered indolent, as such, but 
rather that he is indolent in mathematics or 
English or whatever the field may be. In 
general, then, it is possible that a student might 
be industrious in mathematics, normal in 
English, and at the same time indolent in 
physics. It is also easily conceivable that the 
idea of specific industriousness may extend 
beyond the academic field into the fields, say, 
of commerce and industry. 





106 William C. 

Another conclusion that can be drawn is 
that if indexes of industriousness are compared 
in two fields which have a communality of 
subject matter, there is a possibility of a slight 
carry over of work habits from one field to 
the other, but such a possibility is small and 
sometimes it does not occur at all. 

The specificity of work habits of industrious- 
ness follows very strikingly along lines similar 
to that found by Hartshorne and May (2) 
in their studies of social habits, such as honesty, 
truthfulness, and morality. They found that 
these social habits were specific instead of 
general. That is to say, we cannot speak of 
an honest man, but rather of an honest act. 
For instance, a man may be honest in his 
income tax returns, but dishonest when he 
fails to report to the Lost and Found Bureau an 
article which he has found. In like manner, 
we cannot speak of an industrious individual, 
but rather we must say that an individual is 
industrious in some one area of activity, 
whereas he may be indolent or normal in 
another. 

Although this study has been done only on 
engineering students, it seems reasonable to 
assume that these conclusions should also 


hold for liberal arts students, although that 


fact needs verification. 

Because of the specific nature of work 
habits, it is possible that some of the diffi- 
culties which investigators have had with 
under-achievement is due to their attempt to 
cover too many diverse fields of study at the 
same time. 


Summary 


1. Certainly, as far as engineering students 
are concerned and undoubtedly for others, 
industriousness in any one of the four subjects 

English, chemistry, mathematics, and physics 


Krathwohl 


—does not necessarily imply industriousness in 
any of the remaining three. A_ possible 
exception is one involving mathematics and 
chemistry, in which there is only a slight 
possibility that work habits in mathematics 
may be associated with the same kind of work 
habits in chemistry. 

2. An individual should not be considered 
industrious as such, but rather industrious in 
mathematics or English or whatever the 
subject may be. Such information is partic- 
ularly valuable in a counseling situation where 
it should be remembered that it is possible 
for an individual to be industrious in math- 
ematics, normal in English, and indolent in 
physics at one and the same time. 

3. As far as specificity is concerned,. work 
habits of industriousness are very similar to 
the social habits of honesty, truthfulness, and 
morality which were investigated by Hart- 
shorne and May. 


Received May 14, 1951. 


References 


1. Fairbairn, Helen. Vocational interests. In E. S. 
Jones (Ed.), University of Buffalo Studies, 1930, 
8, 61-65. 

. Hartshorne, H., May, M. A., and Shuttleworth, 
F. K. Studies in the organization of character. 
New York: Macmillan Company, 1930. 

. Krathwohl, W. C. The persistence in college of 
industrious and indolent work habits. J. educ. 
Res., 1949, 42, 365-370. 

. Krathwohl, W. C. Effects of industrious and indo- 
lent work habits on grade prediction in college 
mathematics. J. educ. Res., 1949, 43, 32-40. 

. Krathwohl, W. C. An index of industriousness for 
English. J. educ. Psychol., 1949, 40, 469-481. 

. Krathwohl, W.C. Relative contributions of vocab- 
ulary and an index of industriousness for English 
to achievement in English. J. educ. Psychol., 
1951, 42, 97-104. 

. Lindquist, E. H. Statistical analysis in educational 
research. Boston: Houghton Mifflin, 1940. Pp. 
41-43. 





The Role of Tests in the Medical Selection Program 


Ray B. Ralph and Calvin W. Taylor 


Department of Psychology, University of Utah 


The problem of the proper selection of 
medical students from the ranks of applicants 
has long existed. In recent years this problem 
has been intensified by the increased number 
of applicants desiring admission. While this 
has made the selection ratio more favorable, 
it has at the same time required that medical 
selection committees spend considerable time 
and effort on the complex task of trying to 
identify the best prospects. Recent world 
events together with the current deferment 
program have intensified the importance of 
this problem. It is hoped that the present 
article will focus more attention upon and 
provide further insight into the medical 
selection problem. 

In 1946 the Moss Scholastic Aptitude Test 
was discarded by the Association of American 
Medical Colleges in favor of the Professional 
Aptitude Test, which was renamed the Medical 
College Admission Test (MCAT) in October, 
1948. Hereafter, in order to avoid confusion, 
this officially used test will be called by its 
current name, Medical College Admission Test, 
inasmuch as the only important change, other 
than renaming, made in the test throughout 
the periods herein reported was the addition 
of the subtest, Modern Society. 

The Medical College Admission Test 

Admittedly, evaluation studies on the official 
Medical College Admission Test have not been 
designed as well as would be desirable, primarily 
because all of the medical student samples 
studied have been selected at least partially on 
the basis of the test being evaluated. None- 
theless, one can study the role played by the 
MCAT scores in the complex medical selection 
program by determining means and standard 
deviations on recently selected medical classes. 
The mean and standard deviation for the 
MCAT subtests on each year’s medical 
applicant population are 500 and 100, respec- 
tively. If considerable attention is given to 
a particular MCAT score in the selection 


process, then the medical class would be 
highly selected on that particular character- 
istic, as indicated by a high mean and a low 
standard deviation. On the other hand if a 
score is largely disregarded, then the sample 
selected will have a mean near 500, a large 
standard deviation, and will not differ greatly 
from the total medical applicant population 
on this characteristic. It was also decided to 
use medical academic success, the grade point 
average over the first portion of medical 
training, as a criterion and to validate the 
test scores against this criterion. 

The first reported study on the test was 
performed by Young and Pierson.! Scores on 
the MCAT were correlated with first quarter 
medical college grade point averages for a 
sample of fifty freshman medical students. 
Results of this study are listed in the first 
data column of Tables 1, 2, and 3 under the 
heading ‘11947 Utah Class, First Quarter 
Criterion.” They further reported the grade 
point average for premedical science courses 
(one important basis for student selection in 
the program) to be most highly correlated 
(.50) with first quarter medical college grades. 

The present writers followed up this same 
sample of medical students further and deter- 
mined their scholastic success at the end of 5 
quarters and at the end of the four scholastic 
years (12-quarter accelerated program). The 
additional statistical findings on the 1947 
Utah class are also listed in Tables 1, 2, and 3. 
The 1948 and 1949 Utah classes were studied 
in a similar manner by the present writers with 
the measure of academic success in each case 
being the first medical year grade point 
average. These results are also listed in the 
three tables. Test scores of students in the 


1 Young, R. H., and Pierson, G. A. The Professional 
Aptitude Test, 1947, a preliminary evaluation. J. A. 
A. M. Colls., 1948, 23, 176-179. These investigators 
also studied the Moss Scholastic Test, the Strong Voca- 
tional Interest Blank for Men, and the Minnesota 
Multiphasic Personality Inventory on the same medical 
class. 


107 





Ray B. Ralph and Calvin W. Taylor 


Table 1 
Means for the Medical College Admission Test 











Test Score 


Scientific Vocabulary 
Socia! Vocabulary 
Humanistic Vocabulary 
Composite Verbal Ability 
Quantitative Ability 
Index of General Ability 
Modern Society 


Premedical Science Achievement 


Number of Students Sampled 


1947 Utah Class 





1st Qtr. 


576.2 
534.2 
542.0 
549.0 
528.8 
552.8 


588.0 


50 


5 Qtrs. 


4 Yrs. 


1948 
Class 


1 Yr. 


1949 
Class 


1 Yr. 


Michigan 


Class 


1 Yr. 





579.8 
530.2 
536.8 
551.6 
527.0 
548.9 


591.4 
44 


579.8 
530.2 
536.8 
551.6 
527.0 
548.9 


591.4 
+4 


571.2 
508.8 
507.1 
532.2 
518.2 
529.8 
520.7* 
588.0 


51 


548.6 
520.6 
521.0 
525.4 
573.1 
546.4 
517.7 
583.3 


52 


590.1 
550.4 
547.1 
567.8 
575.4 
575.2 


606.5 
102 





*N = 42 students. 


Test Score 


Scientific Vocabulary 
Social Vocabulary 
Humanistic Vocabulary 
Composite Verbal Ability 
Quantitative Ability 
Index of General Ability 
Modern Society 


Premedical Science Achievement 


Number of Students Sampled 


*N = 42. 


Table 2 


Standard Deviations for the Medical College Admission 


Test 


Iowa 

Class 
1Yr 
525 
542 
526 
535 
568 
547 


540 
81 





1947 Utah Class 


istQtr. 5Qtrs. 4 


Yrs. 


1948 
Class 


1 Yr. 


1949 
Class 


1 Yr. 





76.8 


76.8 


72.8 72.8 
85.1 85.1 
74.2 74.2 
70.7 70.7 


70.4 70.4 


76.4 76.4 


87.1 
76.9 
74.8 
69.9 
82.3 
70.4 
76.8* 
65.8 


51 


73.2 
72.8 
86.4 
67.1 
88.6 
64.5 
71.2 
66.7 
52 


Michigan 
Class 


1 Yr. 


a ae . 


90.7 
83.0 
81.3 
80.4 
80.3 


74.4 


Iowa 


Class 


Table 3 
Validity Coefficients for the Medical College Admission Test 








1948 
Utah Class 


1949 
Utah Class 


Towa 
Class 


Michigan 
1947 Utah Class Class 





Test Score 
Scientific Vocabulary 
Social Vocabulary 
Humanistic Vocabulary 
Composite Verbal Ability 
Quantitative Ability 
Index of General Ability 
Modern Society 


Premedical Science Achievement 


Number of Students Sampled 


Ist Qtr. S5Qtrs. 4 Yrs. 


.23 Bs | 
— .08 07 
—.22 — .06 
—.10 06 

.19 16 

08 


26 
44 


“16 
10 


— .02 


07 
.23 





*N = 42 students. 





Role of Tests in Medical Selection Program 


same class who had taken different forms of 
the Medical College Admission Test were 
lumped together, the different forms being 
considered identical in the statistical treatment. 
This assumption, however, did not particularly 
alter the results found on a subsample of 
those who took only the same form. Since 
only 42 of the 51 students in the 1948 Utah 
class took the newer form with the Modern 
Society subtest, results for this subtest were 
based on the reduced sample. 

The results of two unpublished studies, one 
on a Michigan medical class by R. M. W. 
Travers and the other on an Iowa medical 
class by the University Examination Service, 
are also listed in Tables 1, 2, and 3. Similar 
to the findings of Young and Pierson, the Iowa 
study indicated that the grade point average 
in premedical science classes with a validity 
coefficient of .55 was a better predictor of 
first year medical success than was any part of 
the MCAT. It wasalso found that, in general, 
scores on this 1947 form correlated higher with 
grades already attained in premedical sciences 
than with the subsequently achieved success in 
the first year of medical college. 

An inspection of the means and standard 
deviations of the MCAT subtests on the 
reported studies affords some insight into the 
selection procedure utilized on each class. 
From the results in Tables 1 and 2 it seems 
evident that more attention was given to 
scores in some of the subtests than others in 
the selection of the medical classes. 

An inspection of Table 3 shows the validity 
coefficients for the various subtests across 
the samples were generally low and in several 
cases were essentially zero. Individual sub- 
tests in certain of the studies occasionally 
showed meaningful validities but in many 
cases this result was counterbalanced by 
essentially zero validities in the other studies. 
In terms of medical academic success Scientific 
Vocabulary, Quantitative Ability, and Pre- 
medical Science Achievement appear to be 
the only consistently valid subtests of the 
MCAT, even after considering restriction of 
range. 

In summary, on five samples of medical 
students from three universities, it is evident 
that the medical students are more highly 
selected on certain MCAT subtest character- 


109 


istics than on others. Many of the subtests 
in the MCAT have shown little evidence of 
being valid as predictors of medical academic 
success. It may be possible, however, that 
some of these subtests (e.g., Modern Society) 
are valid for some desirable purpose or purposes 
other than the prediction of medical academic 
success. If any of the subtests were developed 
for other purposes, it would be advisable to 
define these purposes clearly so that studies 
could be designed to see how well the subtests 
achieve these other goals. 


Evaluation of Some Other Aptitude Scores 


It was decided to attempt the validation of 
some other promising aptitude scores to 
determine if they were related to medical 
academic success. The General Aptitude Test 
Battery (GATB) was made available for this 
experimental study. This battery, developed 
for use in public employment office counseling 
programs, consists of 16 tests which yield 11 
aptitude scores (Letter Series, Test “FE,” was 
treated as measuring a separate aptitude). 
Most of these 11 aptitude factors are well 
known. Measures of identical or similarly 
named aptitudes are found in several other 
test batteries and particularly in factorial 
research studies. 

The GATB aptitude scores were established 
so that the mean score for the worker popula- 
tion is 100 and the standard deviation is 20.2 

The 1947 Utah class was tested with the 
GATB after they had completed five quarters 
of medical training. With the exception of 
five persons, all of the forty-nine medical 
sophomores tested had been selected for 
medical training partly on the basis of their 
scores on the MCAT. The means, standard 
deviations; and validity coefficients for the 
GATB aptitudes against the five-quarter grade 


point average are listed in Table 4. These 


? For further information about the General Aptitude 
Test Battery, see the following references: (a) Dvorak, 
Beatrice J. The new USES General Aptitude Test 
Battery. J. appl. Psychol., 1947, 31, 372-376; (b) Staff, 
Div. of Occupational Analysis, WMC. Factor analysis 
of occupational aptitude tests. Educ. psychol. Measmt., 
1945, 5, 147-155; (c) GATB Senior Project Staff, Uni- 
versity of Utah, etal. General Aptitude Test Battery 
patterns for college areas. Occupations, 1951, 29, 518- 
526; and (d) Petrullo, L., Cohen, I. K., and Meigh, C. 
The Employment Service testing program. Emplmt. 
Secur. Rev., 1949, 16, 19. 








Ray B. Ralph and Calvin W. Taylor 


Table 4 


Means, Standard Deviations, and Validities for the 
General Aptitude Test Battery on the 1947 
Utah Class (N = 49) 


Validity 

Coefficient 
Standard ——— 
Mean Deviation 5Qtr. 


143.0 
137.6 


4Yr. 
11.9 A7 54 
14.4 45 42 
132.6 12.8 39 53 
128.0 10.7 Al 37 
126.3 3.7 12 ll 
123.0 20. 14 19 
129.1 ‘ — 06 10 
107.2 3.8 -.15 —.04 
98.9 x. 01 13 
97.5 " -.01 13 
109.8 ~ 06 O05 


GATB Score 
Intelligence (G) 
Verbal (V) 

Numerical (N) 
Spatial (S) 

Form Perception (P) 
Clerical Perception (Q) 
Test “E” 

Aiming (A) 

Motor Speed (T) 
Finger Dexterity (F) 
Manual Dexterity (M) 


students were then followed up through 
graduation and new validities computed against 
the grade point average for the total four-year 
training program. These results are also 
presented in Table 4. 

The method of ‘testing persons who have 
completed training and on whom criterion 
scores of success are already available is often 
utilized to make a rapid evaluation of apti- 
tudes. When this method is used, as in the 
present study on the GATB (particularly in 
the case of the five-quarter criterion), it is 
highly advisable to conduct additional studies 
in order to check the results on the initial 
study. These studies should preferably be of 
the follow-up type in which persons would be 
tested prior lo medical training (but not selected 
on the basis of these experimental test results) 
and then followed up to ascertain their even- 
tual degree of success in training. 

As in the case of the MCAT it appeared 
wise in searching for valid aptitudes to use a 
multiple evaluation approach in examining 
the results for the GATB. The things 
considered for each aptitude were the mean, 
standard deviation, and validity coefficients 
together with a judgment of whether or not 
it makes psychological sense to identify that 
aptitude as important in medical academic 
success. In terms of this multiple evaluation, 
the first four aptitudes in Table 4, namely, 
General Intelligence (G), Verbal (V), Numer- 


ical (N), and Spatial (S), were considered 
sufficiently valuable to warrant further serious 
consideration, whereas the other aptitudes 
were judged to be either of borderline value 
or of no value with regard to success in medical 
training. 


Discussion 


A reduced battery yielding four aptitude 
scores, G, V, N, and S, can be administered in 
less than 45 minutes, of which 29 minutes is 
actual testing time. This time is 1/7 as long 
as the total testing time of 6 hours and 45 
minutes for the current MCAT. Even though 
restriction of range was strongly evident on 
all these GATB aptitudes, a multiple correla- 
tion coefficient of .56 was obtained for this 
reduced battery against the five quarter crite- 
rion. A higher multiple correlation coefficient 
of .60 was found for the 4-aptitude battery 
against the four-year criterion. 

From the above results it appears that the 
four aptitude combination competes favorably 
with the MCAT both in predictive value and 
in testing time required. Although no direct 
comparison is possible because of different 
standardization populations it can be seen 
from all the results presented that there is some 
restriction of range (with subsequent effect 
on the size of validities) on the four GATB 
aptitudes as well as on the MCAT subtests. 
At the same time it should be noted that the 
premedical science grade point average has 
often been found to be the best, or one of the 
best, predictors of medical academic success. 
It appears likely that these results were 
obtained despite the handicap of restriction 
of range resulting from the important role the 
premedical science grade point average plays 
in many selection programs. 

In the 1949 form of the MCAT only one 
Verbal Ability score was given so that the 
profile contained 5 instead of 8 subscores. 
One wonders what the correlation would be 
between this single Verbal Ability score and 
the four verbal scores previously reported on 
the MCAT. Is it the same as the composite 
score or is it identical to one of the Vocabulary 
subtests? On the surface, this Verbal Ability 
is apparently somewhat different in composi- 
tion from the previous composite Verbal 
Ability score and from the three MCAT 





Role of Tests in Medical Selection Program 


Vocabulary scores that have been evaluated 
here. The latest Verbal Ability score is 
described as a composite taken from a vocab- 
ulary section and a reading comprehension 
section. 

More recently the Index of General Ability 
score has been dropped, leaving only 4 scores 
in the MCAT profile: Verbal Ability, Quantita- 
tive Ability, Modern Society, and Science. 
The reduction of the number of scores in the 
profile was for simplicity reasons and in this 
simplification process, most of the subtests 
that were poor predictors of the present 
criterion, medical academic success, were 
eliminated. The Modern Society subtest is 
the only subtest in the current MCAT that is 
primarily symbolic of the need for other well 
defined criteria of medical success. However, 
if not much attention is paid to this subscore 
in the actual selection program, then it is not 
playing its designated role well. A way of 
more certainly insuring that all medical 
doctors have a_ prescribed knowledge of 
modern society would be to require an appro- 
priate training course instead of having 
applicants take a test, the results of which 
might, in practice, be somewhat ignored in the 


complex medical selection program. 
The relationship between all types of scores 
found to have significant scholastic predictive 


111 


value, such as the premedical science grade 
point average, the four aptitude scores, and 
certain subtests still in the MCAT, should be 
investigated.. Of particular interest would be 
the relationship between the GATB Verbal 
Aptitude score and the most recent Verbal 
Ability score. Unfortunately, these two sets 
of scores have not as yet been obtained on the 
same sample. Furthermore, the particular 
combination of all these scores that will yield 
the maximum validity should be determined 
so that the best composite battery of valid 
measures can be available for use in predicting 
scholastic success. 

It is very likely that the best combination 
of the previously mentioned _ predictors 
would still leave a sizable fraction of medical 
academic success untouched. This would 
undoubtedly be also true for any other sug- 
gested criterion of medical success. Further 
research is therefore clearly needed to inves- 
tigate the value of other parts of medical 
selection programs and to develop devices and 
procedures that will get at additional character- 
istics important in total medical success. It 
is also suggested that any new devices and or 
procedures be thoroughly evaluated by means 
of well designed studies before they are widely 
installed. 


Received May 25, 1951. 





Faking Personality Test Scores in a Simulated Employment Situation 


Alexander G. Wesman 


The Psychological Corporation, New York City 


It has been the experience of most industrial 
psychologists that personality and interest 
inventories are ineffective when used for 
selection purposes (1, 2, 3, 4, 6, 7,8). Ordinar- 
ily, many of the items can be seen through by 
most applicants, and the appropriate response 
given. The stereotypes which many employ- 
ment officers seek (e.g., aggressive, self-confi- 
dent salesmen) are also the stereotypes which 
the applicant expects the employer to be 
seeking. He is therefore all too likely to 
respond accordingly. 

The data reported herein were collected in 
the course of a teaching demonstration. The 
author wished to impress a group of extension 
students at a large university with the untrust- 
worthiness of personality inventories in 
employee selection. He gave the Bernreuter 
Personality Inventory to a group of 85 students 
with about the following instructions: 


“I want you to pretend that you are applying 
for the position of salesman in a large industrial 
organization. You have been unemployed for 
some time, have a family to support, and want 
very much to land this position. You are being 
given this test by the employment manager. 
Please mark the answers you would give.” 


The following week, at the start of class, the 


same inventory was again distributed to the 
class, with the following instructions: 


“You are now applying for the position of 
librarian in a small town. You need the em- 
ployment to support your family and meet 
financial obligations. Please mark the answers 
you would give.” 

Both administrations of the inventory 
occurred before there was any discussion of 
the field of personality measurement. The 73 
students who took the test twice were a very 
heterogeneous group in age, academic back- 
ground, industrial experience, and test sophis- 
tication. On the latter variable, they ranged 
from a young lady taking her first course since 
high school, with almost complete innocence 
of the test field, to a young man about to 


receive a Ph.D. in measurement, with several 
years of professional experience behind him. 
Table 1 presents the score distributions 
obtained from these two administrations of the 
inventory for one of the measured traits, 


Self-Confidence (Scale F-1) (5). The table 


Table 1 


Students’ Scores on a Self-Confidence Scale in 
Two Simulated Employment Situations 





Employment Situation 
Self-Confidence —_ 
Scale 





Librarian 


Salesman 


Raw Score* 
260-241 
240-221 
220-201 
200-181 
180-161 
160-141 
140-121 
120-101 
100- 81 
80- 61 
60- 41 
40- 21 
20- 1 


Minus Values 


Wwmawawn 


0O- 19 
20— 39 
40- 59 
60- 79 
80— 99 
100-119 
120-139 
140-159 
160-179 
180-199 
200-219 
220-239 
240-259 
260-279 
280-299 1 


Total 73 73 


Plus Values 


me wWHeK DK PK PWD WwW WwW 





* Minus scores represent greater self-confidence. 


112 





Faking Personality Test Scores 


speaks eloquently for itself. If one saw these 
distributions without foreknowledge of how 
they were obtained, he could only conclude 
that they represented two quite different 
groups of people. The first column, ‘“‘Sales- 
man,” is apparently composed of people who 
are, with three exceptions, above average 
in self-confidence. The second group, “Librar- 
ian,” seems to contain almost as many below- 
average people on this trait as above-average 
(34 and 39, respectively). Those at the fifth 
percentile of the first group are more self- 
confident than the “applicants” at the fiftieth 
percentile of the second group. It is hard to 
realize that these “two’’ groups are really one 
and the same, except that the positions for 
which they are pretending to apply are 
different. 

The demonstration is, of course, artificial. 
These are not true applicants. They are 
students pretending that they are applicants. 
Unquestionably, some of them are more test- 
wise (and stereotype-wise) than the average 
real applicant. Nonetheless, the demonstra- 
tion seems to the author sufficiently dramatic 
to point up the susceptibility to faking of 
personality inventories in the industrial situa- 


113 


tions. Teachers who have not already used 


similar demonstrations with their students 
will find this approach rewarding. 


Received June 4, 1951. 


References 


. Benton, A. L., and Kornhauser, G.I. A study of 
“score faking” on a medical interest test. J. 
Ass. Amer. Med. Coll., 1948, 23, 57-60. 

. Bordin, E. S. A theory of vocational interests as 
dynamic phenomena. Educ. psychol. Measmt., 
1943, 3, 49-65. 

. Cofer, C. N., Chance, June, and Judson, A. J. A 
study of malingeringonthe MMPI. J. Psychol., 
1949, 27, 491-499, 

. Ellis, A. The validity of personality questionnaires. 
Psychol. Bull., 1946, 43, 385-440. 

. Flanagan, J.C. Factor analysis in the study of per- 
sonality. Stanford: Stanford University Press, 
1935, Pp. 103. 

. Hunt, H. F. The effect of deliberate deception on 
Minnesota Multiphasic Personality Inventory 
performance. J. consult. Psychol., 1948, 12, 396- 


. Longstaff, H. P. Fakability of the Strong Interest 
Blank and the Kuder Preference Record. J 
appl. Psychol., 1948, 32, 360-369. 

. Paterson, D. G. Vocational interest inventories in 
selection. Occupations, 1946, 25, 152-153. 








The Relationship Between Ortho-Rater Tests of Acuity and 
Color Vision in a Senescent Group 


Robert W. Kleemeier 
Moosehaven Research Laboratory, Orange Park, Florida 


In a recent report on Ortho-Rater norms 
and sex differences Ely, Kephart, and Tiffin 
(2) noted that a sample of 7,597 male and 
2,457 female industrial employees showed an 
unexpected difference in color vision scores. 
These authors say, “It will be noted that in the 
color vision test a difference in favor of the 
men was found. This difference was signif- 
icant at the 1% level. The authors are aware 
that this finding is contrary to long accepted 
theories and facts about the distribution of 
color blindness among the sexes. The explana- 
tion for this difference in findings is not 
known.” 

Presented below is evidence gathered from 
tests administered to a group of aged male 
subjects, which, we believe, provides the 
explanation to the above mentioned dilemma. 
This evidence seems to indicate that the 
answer lies not in the realm of color vision, but 
rather stems from the fact that in the industrial 
sample studied women had significantly poorer 
distance acuity than men. Thus, their poorer 
performance on the color test is, perhaps, 
simply a reflection of their poorer visual acuity. 


Method 


Subjects in our study were 128 male residents 
in a fraternal home for the aged. Table 1 
shows the age characteristics of this group. 
The tests were given as a part of a routine 
battery administered to residents of the home. 


Table 1 


Age Distribution of Subjects 


Group Age 
A 65-70 
B 71-75 
Cc 76-80 
D 81-85 


Total 


Far distance binocular acuity test scores and 
color test scores were available on 123 of the 
total group tested. In addition, paired near 
distance acuity and color scores were obtained 
on 127 of this group. 

All tests of visual performance were made on 
the Ortho-Rater under standard conditions 
(3). Our aim in giving these tests was to 
measure the quality of visual performance 
exhibited by the subject at the time of testing, 
consequently, subjects who customarily wore 
corrections were tested while wearing them. 
At the completion of such tests, measures of 
visual acuity were obtained without correc- 
tions. 


Results 


A product-moment r of .675 was obtained 
between the Ortho-Rater color test (F-7) and 
the far distance binocular acuity test (F-3). 
Using z transformations, the 1% fiducial limits 
of this correlation are .786 and .523. N is, 
of course, 123. 

A somewhat lower but still significant r of 
.487 was obtained betwee. results on the near 
acuity test (N-1) and the color test. The 1% 
fiducial limits of this r (N=127) are .646 and 
.295. Since the color test is given at the far 
distance this lower r with near acuity is to 
be expected. 

To round out the intercorrelational possibil- 
ities presented here, we find an r of .565 
between the acuity tests at the two distances. 
With an N of 85, the 1% fiducial limits for 
this r are .335 and .724. 

Because of the relatively poor visual acuity 
in our group, the relationship between perfor- 
mance on acuity and color tests was imme- 
diately obvious to the examiner. Those who 
had great difficulty with the test objects in 
the acuity tests regularly exhibited difficulty 
not only with the color test objects but with 
all other visual tests in the battery. It was 


114 





Relationship between Ortho-Rater Test and Color Vision 


this observation which led us to correlate 
visual acuity and color. 

In Figure 1 the median far distance binocular 
visual acuity scores for the four age groups 
shown in Table 1 are given. It will be noted 
that the senescent group has considerably 
poorer preformance than the younger industrial 
group (2). Thus, on the Ely, Kephart, and 
Tiffin norms, the median scores for our four 
age groups on Test F-3 would be as follows: 
(A) eighth, (B) sixth, (C) third, (D) second 
percentile. These scores Show somewhat 
dramatically the amount of deterioration which 
has taken place in the visual acuity of this 
particular senescent group. It seems, how- 
ever, that the visual performance of the great 
majority of these men is adequate for the 
demands made upon them. 


Discussion 


In view of our findings, the explanation of 
the poor visual performance of the women in 


SCORE (F-3) 


ORTHO-RATER 





1 1 


B ¢ 
AGE GROUP 





Fic. 1. Median far distance binocular visual acuity 
scores for age groups shown in Table 1. Quartiles 
indicated by dotted lines. For an Ortho-Rater score 
of 10 the equivalent Snellen notation is 20/20. 


115 


the industrial group on the color vision test 
seems obvious. Ely, Kephart and Tiffin note 
that the mean score for males on the color test 
(F-7) was 5.08 and the mean for females was 
4.68 on this test. They also show that the 
mean score for far distance binocular acuity 
(F-3) for their male population was 10.69 and 
for the female population was 9.64. This 
difference of 1.05 is significant at the 1% level. 
In view of our finding of a correlation of .675 
between far distance binocular acuity and the 
color perception test, it is not at all surprising 
that the women in this particular industrial 
sample scored lower than the men on Test F-7 
(color). Thus, it would seem that the major 
reason for their lower score was not a defi- 
ciency in color perception but rather a 
deficiency in visual acuity. They simply 
couldn’t see the color chart as well as could 
the men. 

These results also have bearing upon an 
observation made by Boice, Tinker and Pater- 
son (1) who obtained in a small male sample 
(N=40), age 60 years or older, an unusually 
high percentage of color blindness (20%). 
This evidence, they state, suggests the possibil- 
ity “. . . that, with advanced age, changes in 
the retina, the optical nerve or the visual 
cortex occur in an unusually high percentage 
of cases.” Here, too, the factor of visual 
acuity needs control before we speculate too 
much upon the existence of a special deteriora- 
tion of color vision with age. 

Tiffin (4, p. 225) has also noted a diminution 
of color sensitivity with age. Using Ortho- 
Rater data gathered on an industrial sample 
of over 10,000 men and women, he observed 
that “. . . after age 45 both sexes show a loss 
in color vision. In an earlier report .. . it 
was shown that decreases in color vision 
began by age 25. Both studies agree that 
color vision deteriorates with advanced age.” 

In view of possible contamination of these 
results with uncontrolled visual acuity, these 
reported age trends in color vision are open to 
question. Thus, it would appear that any 
attempt to ascertain the relationship between 
color vision and age can be successful only if 
visual acuity is somehow controlled. This is 
particularly true when pseudo-isochromatic 





116 Robert W. Kleemeier 


color tests such as the Ishihara or the Ortho- 2. Ely, J. H., Kephart, N. C., and Tiffin, J. Ortho- 

Rater are used. Rater norms and sex differences. J. appl. 
Received January 11, 1952. OT TO: 

Early publication. 3. Standard practice in the administration of the Bausch 

& Lomb occupational vision tests with the Ortho- 

References Rater. Rochester, N. Y.: Bausch and Lomb, 


1. Boice, Mary L., Tinker, M. A., and Paterson, D. G. 1944 


Color vision and age. Amer. J. Psychol., 1948, 4. Tiffin, J. Industrial psychology. (2nd Ed.) New 
61, 520-526. York: Prentice-Hall, Inc., 1947. 








Note on Table for Use With Spearman-Brown Formula 


Lee W. Cozan 
Hechinger Company, Washington, D. C. 


In order to facilitate the use of the Spearman- 
Brown prophecy formula, the writer has 
prepared a table that shows the effects of 
increasing the number of independent measure- 
ments upon the reliability coefficient! The 
table is simple to use. The table is entered 
vertically by the original reliability coefficient 
and horizontally by the number of times the 
measure is increased. 

For example, if the reliability coefficient of 

' To reduce printing costs the table has been deposited 
with the American Documentation Institute. Order 
Document 3308 from American Documentation Insti- 
tute, 1719 N Street, N.W., Washington 6, D. C., re- 
mitting $1.00 for microfilm (images 1 inch high on 
standard 35 mm. motion picture film) or $1.00 for 
photocopies (6 * 8 inches) readable without optical aid. 


a twenty minute employment test is 0.50, 
increasing the length of the test to one hour 
should increase the reliability coefficient to 
0.75. If the reliability coefficient of perform- 
ance ratings made by one supervisor is 0.75, 
the pooled ratings of five raters should be 0.94. 
This table permits rapid and accurate deter- 
mination of the reliability coefficient and 
eliminates all calculations previously involved 
in the application of the Spearman-Brown 
prophecy formula. 

It is hoped that the applicability and 
utility of the table will be revealed by future 
research. 


Received June 1, 1951. 


Editor’s Note: At the page proof stage the Editor 
discovered to his mortification that a table and a 
nomograph for the Spearman-Brown Formula were 
published by Dunlap, J. W. and Kurtz, A. K., in 
Handbook of statistical nomographs tables and formulas, 
by the World Book Company in 1932. Had he been 
aware of this, this article would not have been accepted. 
— Editor. 








The Scaling of Stimuli by the Method of Successive Intervals * 


Allen L. Edwards 
The University of Washington 


We are sometimes faced in psychological 
research with the problem of ordering a set of 
stimuli or objects on a psychological continuum 
when the relative positions of the same stimuli 
on a physical continuum are unknown. 
Suppose, for example, that we have available a 
set of n stimuli. We assume that these stimuli 
possess varying but unknown degrees of some 
defined attribute. We wish to define opera- 
tionally a psychological scale for this attribute 
and to determine the values of the stimuli on 
the defined scale. 

Applying Thurstone’s (6, 7) well-known Jaw 
of comparative judgment to data obtained by 
the method of paired comparisons provides 
one solution to the scaling problem. The 
method of paired comparisons, however, re- 
quires n(n—1)/2 judgments for the » stimuli. 
It is obvious that the method is experimentally 
impractical when the number of stimuli to be 
scaled is large. Twenty-five stimuli, for 
example, would require 300 comparative judg- 
ments from each subject. 

Method of Successive Intervals 

In the present paper we shall describe. an 
alternative method of scaling which possesses 
the following properties: (1) the method re- 
quires but a single judgment from each subject 
for each stimulus; (2) the method yields 
scale values which are linearly related to those 
obtained by the method of paired comparisons; 
(3) the method provides its own internal con- 
sistency check upon the validity of the various 
assumptions made; and (4) the computations 
involved are quite simple. The theoretical 
development of this method of scaling, which 
we shall call the method of successive intervals, 
has been described elsewhere (2). 

The basic data are obtained in the form of 
judgments or ratings of each stimulus in terms 
of successive intervals or categories represent- 

* This paper was prepared while the writer was a 


post-doctoral Research Training Fellow of the Social 
Science Research Council! 


ing increasing amounts of the defined attribute. 
No assumption, such as that involved in the 
method of equal-appearing intervals (8), is 
made concerning the widths of the successive 
intervals. The only requirement is that each 
successive interval represent an unknown but 
additional amount of the attribute. 

It is in the nature of the scaling problem to 
determine the widths of the intervals making 
up the psychological continuum. We make 
the assumption that the judgments for each 
stimulus are normally distributed on the 
unknown psychological continuum. The scale 
values of the stimuli are then defined as the 
means of the distributions of judgments as 
projected upon the psychological continuum. 

For purposes of illustration, we shall use 
data reported by Saffir (5).!_ In Saffir’s study, 
subjects judged the extent to which they would 
like to associate with various nationalities. 
Ten rating categories were used. We have 
rearranged Saffir’s data so that the first 
category represents nationalities which the 
subjects would least prefer to associate with 
and the last category represents nationalities 
which the subjects would most prefer to 
associate with. From the frequency distribu- 
tions of ratings, we obtain the cumulative 
distributions of Table 1. 

The matrix of Table 1 is of order »Xr where 
n is the number of stimuli and r is the number 
of categories. Let the general element of this 
matrix be pj. Any element p, will then show 
the proportion of subjects placing a given 
stimulus j in the kth category or below. The 
values 1—p,, will show the proportion of 
subjects placing stimulus j above the kth 
category. All subsequent calculations are 
based upon the data of Table 1. They can 
be described in terms of a series of matrices. 

The scale values of the stimuli are unknown. 

! For experimental reasons five of the nationalities 
rated in Saffir’s study are not reported upon here. 
Three additional nationalities could not be scaled by 
the technique described. The reason for this will be 
discussed later 


118 





Scaling of Stimuli by Method of Successive Intervals 


Table 1 





Least Preferred 


2 


Nationality 1 2 
01 
00 
.02 
.00 
04 
00 
.00 


01 
.00 
O02 
.00 
mi 
00 
01 


Austrian 

Belgian 

Frenchman 

German 

Greek 

Hollander 

Irishman 

. Italian 

. Japanese 

. Mexican 

. Negro 
Norwegian 

. Pole 

. Scotchman 

. S. American 

. Spaniard 

Swede 


a. 
2. 
3. 
4. 
3. 
6. 
7. 


17 
O08 
AZ 
00 
08 
00 


ae 
18 


.00 
01 


* After Saffir (5). 


Assuming, however, that the distributions of 
judgments are normal on the psychological 
continuum, the boundaries of the categories 
can be expressed as normal deviates. If the 
table of the normal probability curve is 
entered with the value 1— pj, the correspond- 
ing normal deviate will be the upper limit of 
the kth category (or the lower limit of the kth 
+1 category). The first stimulus, Austrian, 
for example, provides estimates of the upper 
limits of categories 4, 5, 6, 7, 8, and 92 Ex- 
pressed as normal deviates, these boundaries 
are —1.64, —1.23, —.47, .08, .67, and 1.34, 
respectively. 

Each stimulus will provide an estimate of one 
or more boundaries. These estimates make up 
the Xj, matrix which, of necessity, cannot be 
of order larger than nX(r—1). A stimulus 
whose frequencies are distributed over all r 
categories, for example, may provide estimates 
of r—1 boundaries.* It is important to note 


2In determining the boundaries of the categories, 
values of 1 — pj greater than .95 and less than .05 may 
be ignored. Such values would be determined by only 
a small number of observations and are regarded as 
unreliable. 

3 No estimate can be obtained of the upper boundary 
of the rth category and no estimate can be obtained of 
the lower boundary of the first category. If more than 


Cumulative Distributions of Judgments for Nationality Preference Data* (N = 133) 


Most Preferred 


9 
91 


10 


1.00 
1,00 
1.00 
1.00 
1.00 
1,00 
1.00 
1.00 
1.00 


84 
.63 


that the Y,, values can be obtained without 
any reference to the precise location of the 
scale values of the stimuli. 

Since the cell entries of the X ;, matrix corre- 
spond to upper limits of the &th intervals (or 
the lower limits of the &th+1 intervals), the 
differences X j,;:—X,, will provide estimates 
of the widths of the successive intervals. 
For the first stimulus, Austrian, these succes- 
sive differences are .41, :76, .55, .59, and .67. 
These are estimates of the widths of intervals 
5, 6, 7, 8, and 9, respectively. Obtaining the 
similar differences for each of the other stimuli, 
we have a matrix in which the entries of each 
column are estimates of a common interval. 
We assume that the best estimate of the 
interval width is given by the mean of the 
column entries.“ The obtained means are 
38, .40, .42, .41, .45, 52, .78, and 1.04. They 
represent the widths of intervals 2, 3, 4, 5, 6, 
7, 8, and 9, respectively. Cumulating the 
means for the successive intervals, we have 
50 of the judgments for a given stimulus fall in either 
of these categories, the stimulus cannot be scaled by 
the method described. It is for this reason that the 
three nationalities mentioned earlier were omitted. 

‘The calculations up to this point are the same as 
those described by Attneave (1) for his method of 
graded dichotomies. 





Allen L. Edwards 


Table 2 


Theoretical Cumulative Distributions Obtained from Scale Values and Interval Widths 


Least Preferred 


Scale Values of 
Nationalities 

(2.51) 

(3.04) 


1. Austrian 
z 
(3.48) 3. 
4. 
5. 


Belgian 
Frenchman 
German 
Greek 

. Hollander 


(3.99) 
(1.29) 
(3.16) 
(3.89) 
(2.07) 
( .77) . Japanese 
(1.06) . Mexican 
( 07) 11. 
(3.23) 12. 
(1.69) 13. Pole 05 : 18 
(4.07) 14. Scotchman j d .00 
(1.83) 15. S. American . ; 5 
(2.21) 16. Spaniard 

(3.35) 17. Swede d d 1 


. Irishman 
8. Italian 


Negro 
Norwegian 





the common psychological continuum for all 
stimuli. 

With knowledge of the psychological con- 
tinuum, it is a simple matter to find the scale 
values of the stimuli. In terms of our earlier 
discussion, they will be the medians of the 
distributions of judgments as projected upon 
the psychological continuum. They may be 
computed by formula, interpolating within 
a specified interval to find the point below 
which and above which 50 per cent of the 
judgments fall. 

Internal Consistency Check 

We have placed no restrictions upon the 
distributions of Table 1, other than that the 
entries in the last column must equal 1.00. 
We thus have n(r—1)=17(10—1)=153 inde- 
pendent entries in the table. We have 
available the n=17 scale values and the 
r—2=8 interval widths, or a total of 25 
parameters. If the assumptions we have 
made are tenable, it should now be possible to 
reproduce the 153 empirical values from the 


25 parameters—within a specified margin of 
error. 


Most Preferred 


Cumulative Interval Widths 


1.20 1.61 2.06 2.58 4.40 


4 5 6 7 : 9 

09 18 33 ae i 97 
08 16 

1 : 08 


46 62 78 
06 14 
00 
32 
67 80 .90 
56 71 84 
87 94 .98 
02 05 12 
31 AT 64 
01 .02 
.26 Al 59 
.27 44 
02 04 10 


At the left of Table 2 we show the scale 
values of the stimuli upon the common 
psychological continuum. At the top of the 
table we have reproduced the psychological 


Table 3 


Distribution of Discrepancies Between Observed and 
Theoretical Values of Table 1 and Table 2 


Discrepancies 
.06 
05 
04 
03 
02 
01 
.00 

—.01 
— .02 
— .03 
— .04 
—.05 
— .06 
—.07 
— .08 
—.09 
—.10 





Scaling of Stimuli by Method of Successive Intervals 


continuum. If we now subtract the scale 
values of the stimuli from the cumulative 
interval widths, we shall have a matrix of 
theoretical normal deviates X’. The X’ x 
values will be the boundaries of the successive 
intervals as expressed in terms of normal 
deviates from the scale values projected upon 
the psychological continuum. The entries of 
this matrix for the first stimulus, Awusirian, 
for example, would be —2.51, —2.13, —1.73, 
—1.31, —.90, —.45, .07, .85, and 1.89. These 
values would correspond to the upper limits 
of the intervals 1, 2, 3, 4, 5, 6, 7, 8, and 9, 
respectively, on the psychological continuum 
for the first stimulus. From a table of the 
normal probability curve, it is now possible to 
determine the corresponding proportion of 
judgments falling below each of the successive 


PAIRED COMPARISONS 





121 
intervals. These values are the cell entries 
of Table 2. 

The entries in each row of Table 2 are 
theoretical cumulative distributions. If the 
assumptions we have made are tenable, they 
should reproduce the empirical distributions of 
Table 1. If we make the matrix subtraction 
of Table 2 from Table 1, we shall have the 
discrepancies between our empirical and theo- 
retical values. The distribution of these 
errors is shown in Table 3. It can readily be 
determined that the absolute mean discrepancy 
is .021. This means that from our 25 para- 
meters we can reproduce the empirical distribu- 
tions of judgments with an average error of 
only .021. 

The mean discrepancy of .021 compares 
favorably with the values usually reported for 


l 





2.4 


SUCCESSIVE INTERVALS 


Fic. 1. 


Scale values obtained by the method of paired comparisons and by the method of successive intervals. 








122 


the internal consistency check applied to 
paired comparison data. Guilford (3, p. 231), 
for example, reports an average error of .027, 
Hevner (4) an average error of .024, Thurstone 
(9) an average error of .029, and Saffir (5) a 
value of .031 for paired comparison data. 

In Figure 1 we have plotted the scale values 
obtained by the method of paired comparisons, 
as reported by Saffir, against those obtained 
here by the method of successive intervals. 
It is obvious that the relationship is linear and 
that the scatter is relatively small. 

We mentioned earlier that the distributions 
of judgments for five nationalities were not 
used in determining the psychological con- 
tinuum. ‘These five nationalities were omitted 
for experimental reasons. We wanted to see 
if the scale values obtained by projecting the 
distributions of judgments for these five 
nationalities upon the psychological continuum 
would be consistent with the scale values of 
the other stimuli. The plotted points for 
these five nationalities are shown in Figure 1 
as small circles. It seems evident that their 
scale values are consistent with those obtained 
for the other 17 stimuli and with the corre- 
sponding values obtained by the method of 
paired comparisons. 


Summary ' 


The method of successive intervals can be 
applied to any number of stimuli. Only 
judgments for stimuli are required from each 
subject in contrast with the n(n—1)/2 judg- 


Allen L. Edwards 


ments required in the method of paired compari- 
sons. Yet the scale values obtained by the 
method of successive intervals are shown to be 
linearly related to those obtained by the method 
of paired comparisons. Furthermore, the meth- 
od of successive intervals, like the method of 
paired comparisons, provides its own internal 
consistency check. The average error in 


reproducing the empirical data from a limited 
number of parameters is shown to be compara- 
ble to the values reported for the method of 
paired comparisons. 


Received May 31, 1951. 


References 


. Attneave, F. A method of graded dichotomies for 
the scaling of judgments. Psychol. Rev., 1949, 
56, 334-340. 

. Edwards, A. L. Psychological scaling by means of 
successive intervals. Psychometric Laboratory 
Report No. 69, May, 1951. Univ. Chicago. 

. Guilford, J. P. Psychometric methods. New York: 
McGraw-Hill, 1936, p. 231. 

. Hevner, Kate. An empirical study of three psycho- 
physical methods. J. gen. Psychol., 1930, 4, 
191-212. 

. Safir, M. A. A comparative study of scales con- 
structed by three psychophysical methods. Psy- 
chometrika, 1937, 2, 179-198. 

. Thurstone, L. L. Psychophysical analysis. 
J. Psychol., 1927, 38, 368-389. 

. Thurstone, L. L. A law of comparative judgment. 
Psychol. Rev., 1927, 34, 273-286. 

. Thurstone, L.4L., and Chave, E. J. 
ment of attitude. 
1929. 

. Thurstone, L. L. 
erences. 


Amer. 


The measure- 
Chicago: Univ. Chicago Press, 


Unpublished study of food pref- 





Paired Comparison Ratings. 
Reductions in the Number of Pairs 


Ernest J. McCormick 


Occupational Research Center, Purdue University 


I. The Effect on Ratings of 


and 
John A. Bachus 


The Kroger Company, Cincinnati, Ohio 


There has been rather general agreement 
that the paired comparison system is a 
satisfactorily reliable method of obtaining 
relative judgments in various situations, 
including employee rating. Its limited use in 
employee rating (as well as in other situations), 
however, probably in large part is attributable 
to the fact that it is time consuming and is 
fatiguing to the judges if there are very many 
individuals (or other stimuli) to be judged.! 

It was the purpose of this investigation to 
determine the extent to which it would be 
possible, in paired comparison ratings of 
employees, to use reduced numbers of pairings 
and still achieve essentially the same rating 


results as would be obtained from a complete 
pairing of all individuals within the group. 


Experimental Procedure 


Employees Rated. Through the cooperation 
of a manufacturing company two independent 
groups of 50 employees each were rated by 
their respective foremen. The individuals in 
Group I, consisting entirely of women, worked 
in the assembly department and were engaged 
in the task of assembling the small parts of 
electric meters. The individuals in Group 
II, consisting of 48 women and 2 men, worked 
in the machine department and were engaged 
in forming and finishing small parts to be used 
in electric meters. 

Preparation of Pairs for Rating. A complete 
pairing of each of the individuals in each 
group with every other individual results in 
1,225 pairs. An IBM card was punched for 


! The number of pairs increases greatly with increasing 
N’s. The total number of pairs, where each stimulus 
N(N — 1) : 

{ s——- , where A 


is paired with every other one, is 


is the number of stimuli. 


123 


each of the 50 individuals in each group; by 
other special machine methods cards were 
prepared for all 1,225 pairs for each group.’ 
Through mechanical methods the names of the 
two individuals in each pair were printed on 
the top edge of the card. Each person’s name 
appeared on the right side and on the left 
side of the cards respectively in about half of 
the pairs. 

Random numbers were also reproduced into 
the cards, and the cards for each group were 
then “sorted” by machine into random order 
before presenting them to the foremen who 
were to serve as raters. 

Rating of Employees. The cards were then 
presented to the foremen with typewritten 
rating instructions. These instructions pro- 
vided that the employees in each pair be 
judged in terms of the following question: 
“Which of these two employees is doing her 
(his) present job better?” For each pair, the 
foreman was asked to place a check mark be- 
side the name of the employee whom he judged 
to be the better. The foreman of each group 
rated only the members of his respective group. 

Performance Rating Indexes. On the basis of 
the judgments made by the raters, scale values 
were determined for all employees in each 
group. For this purpose performance rating 
indexes provided with the Personnel Compari- 
son System*® were used. This performance 
rating index is determined on the basis of the 


? Appreciation is expressed to Dr. N. C. Kephart, 
Occupational Research Center, Purdue University, for 
his assistance in developing the procedures for the prep- 
aration and subsequent processing of the IBM cards. 

* The Personnel Comparison System, developed by 
C. H. Lawshe and N. C. Kephart, is an employee rating 
system based on the paired comparison method. The 
Personnel Comparison System is available from Mayer 
and Company, 15 East 8th Street, Cincinnati, Ohio. 





Ernest J. McCormick and John A. Bachus 


EMPLOYEE NUMBER 
5 6 7 86 9 1 tt 12 13 14 15 





o©o@mnoevweebre wn - 


EMPLOYEE NUMBER 
oe = 6 


v 


x} x x 


4 


rs 


x[ x] x 
= 


3 











3. 1. Illustration of matrix of pattern of partial 
pairing; “x” identifies paired individuals. 


total number of individuals paired, and of the 
number of times an individual was chosen by 
the rater over other individuals. More specif- 
ically, a rating is based on the proportion of 
times an individual is preferred, converted 
to standard scores. The scale values tend 
toward a normal distribution and provide for 
a mean of 50 and standard deviation of 10. 
The rating indexes actually range from a low of 
23 to a high of 77. 

Patterns for Pairing Individuals. Through 
an empirical approach, various “patterns” 
were developed for the “partial” pairing of 
each individual (i.e., an employee is paired with 
fewer than all the other employees in the 
group). These patterns, used for the partial 
pairing of the original groups of 50, provided 
for the pairing of each individual in the group 
with various numbers of other individuals, as 
follows (the letter identifies the pattern, and 
the number given is the number of pairs per 
individual for the pattern): A-40; B-35; C-32; 
D-28; E-25; F-21; G-17; H-13; I-9; and J-7. 

Four patterns were also used in the pairing 
of two groups of 30 individuals who had been 
randomly extracted from the original groups of 
50. Three of these patterns (A, E, and H) 
were patterns which had been used with the 
groups of 50, and which were also applicable 
to groups of 30. The other pattern (K) was 
specifically developed for use with the groups 
of 30 individuals. These patterns provided for 


the following numbers of pairs for each of the 
30 individuals: A-24; E-15; K-12; H-8. 

The Character of the Patterns. These pat- 
terns, when worked into a triangular matrix, 
indicate which individuals shall be paired for 
rating. Such a matrix indicates the identifica- 
tion numbers of the employees to be paired, 
and is of course based on the assumption that 
the identification numbers have been assigned 
to the individuals in a random manner in so 
far as skill on the present job is concerned. 
Figure 1 shows, as an illustration, a completed 
matrix (Pattern D) for a group of 15 employees 
when each is paired with 8 other employees. 
An “x” at the intersection of any column and 
row indicates that the two employees repre- 
sented are to be paired when using this pattern. 

Any given pattern is suitable for use with 
certain V’s, but not for use with other N’s. 
A pattern is suitable for use with a given V 
if it results in all individuals being paired with 
an equal number of other individuals. For any 
particular pattern, then, the combination of 
pairs resulting for a given N will determine 
whether or not each of the V individuals is 
paired with an equal number of other individ- 
uals; this in turn will determine whether the 
pattern is or is not suitable for the NV in 
question. 

The N’s with which a particular pattern 
can be thus used, however, increase in multiples 
of a constant for that pattern; for an V which 
coincides with any such multiple there will 
result an equal number of pairs per individual. 
For any given pattern, this increase in multi- 
ples of a constant can be thought of as a 





Pattern 
“Rnytha" 








No, pairs :N=50 


pee 2 
mployee MH. 








Xm wx 
' 


1 uM KM 


mK KK 


3 


& 


1 wxKKKH 


Individuals with whom employee 
x 
ix * * x 


mo.l is paired (indicated by 












































Fic. 2. Segments of first columns of patterns (under- 
scoring shows beginning and ending of “rhythm’’). 





Paired Comparison Ratings 


“rhythm” for the pattern. Starting out with 
the smallest NV for which a pattern results in 
equal pairs per individual, it is possible to 
determine empirically the next greatest N 
for which there will also be equal pairs per 
individual. The difference between these two 
N’s is the size of the rhythm (in terms of 
individuals) for the pattern. The extension 
of the pattern to N’s that are increased by 
multiples of this constant will result in an 
equal number of pairs per individual for any 
such V. 

In Figure 1, for example, the rhythm is 
complete at each of the broken lines. As 
presented, this pattern would be suitable for 
an N of 8 (with each person paired with 4 
others); an extension of the pattern to 15 also 
results in equal pairing (8 pairs per individual). 
The difference of seven (15—8) is the size of 
the rhythm for the pattern. This pattern 
would therefore be suitable for larger N’s 
which increase in multiples of seven, such as 
22, 29, 36, 43, 50, etc. This pattern, extended 
to accommodate 50 individuals, was one of 
those used in the investigation. 

Figure 2 characterizes the several patterns 
used in the investigation. The heading of this 
figure shows, for each pattern, the size of the 
rhythm (in numbers of individuals), and the 
number of pairs resulting from the pattern 
where it was used with the NV’s of 50 or the V’s 
of 30, respectively. The body of the figure 
shows a segment of the first column for each 
of the patterns; the column for a given pattern 
identifies (by ‘“‘x’’) the individuals with whom 
employee number one is paired for that pattern. 
The underscoring shows the points at which the 
rhythm of each pattern is complete; an exten- 
sion downward in the column of this same 
rhythm through the remaining individuals 
would then give a complete first column for the 
total of 50 or 30 employees depending on the 
group or groups with which the pattern was 
used. For any given pattern, then, knowing all 
of the individuals with whom employee number 
one is to be paired (column 1), it is only neces- 
sary to complete the triangular matrix by filling 
in the diagonals down toward the right, as 
shown in Figure 1, to identify all of the pairs 
in that pattern. 

Method of Deriving Rating Indexes for 
Various Patterns. It should be mentioned 


125 


that in using these patterns the foremen were 
not required to re-rate the members of their 
groups. The cards containing the pairs re- 
quired for a given pattern were extracted from 
all the cards used in the initial complete pairing 
of each original group of 50.. By this procedure 
it was then possible to make an independent 
tally, for each pattern, of the number of times 
each employee was preferred over the others 
with whom he was paired in the pattern in 
question. For each pattern, performance 
rating indexes were then obtained for all 
employees in the group in essentially the same 
manner used in obtaining rating indexes based 
on all possible pairs. One modification of the 
procedure was necessary, however, in using 
the performance rating index table to derive 
rating indexes resulting from the various 


patterns of partial pairings; for each pattern, 
instead of entering the table for an NV of 50 
(or 30 in the case of the smaller groups), the 
“N” for a given pattern was considered to be 
the number of pairs per individual, for that 
particular pattern, plus one. 


Results 


The rating indexes obtained with each 
pattern for the employees of Group I and of 
Group II were correlated with the rating 
indexes obtained from the complete pairing. 
Similar correlations were computed for the 
smaller groups, Group III and Group IV. 
The resulting correlations are given in Table 1. 

The differences in the two correlations for 
each pattern were then subjected to tests of 
statistical significance. Such tests were made 
in order to ascertain whether differences in the 
two correlations could or could not reasonably 
be attributed to chance fluctuations. In 
making such tests the correlation coefficients 
for both groups were converted to Fisher’s z 
coefficients. For each pattern the difference 
between the z’s for the two groups was then 
determined. This difference was in turn 
divided by the standard error of the difference 
between the two coefficients, using the formula 
provided by Guilford (1, p. 224). The resu!t- 
ing t ratios are presented in Table 1. 

It will be observed that none of the t ratios 
even approaches the 5 per cent confidence 
limits (1.96). Since none of the pairs of r’s 
differ significantly, it may be inferred that the 








Ernest J. McCormick and John A. Bachus 


Table 1 


Correlations, for Two Independent Samples, of Scale 
Values Resulting from Various Patterns of 
Partial Pairings with Those Resulting 
from Complete Pairing 


Correlations 
for Two Groups 
Pairs Groups of 50 
Per Total —————————— 
Pat- Indi- No. of Group 
tern _- vidual Pairs I 
40 1,000 991 
875 .992 .992 
800 993 987 
700 980 .984 
625 961 971 
525 .960 .948 
.962 .949 
935 .928 
.936 885 
858 888 


Group 
II 


994 


Nw 


~~ 


_— 
Nm 
wn 


a_ammewnNW & @ w 
J 
mons 


wonr— un 


Groups of 30 


Group Group 
Ill IV 


360 .996 .994 
225 991 979 
180 .961 .946 
H 5 120 948 .898 


magnitudes of the various r’s cannot reasonably 
be attributed to chance fluctuations, and that 
they therefore presumably reflect the approx- 
imate degree to which ratings resulting from 
the respective patterns of partial pairings 
actually reproduce the ratings based on a 
complete pairing. 

Ratings from Complete Versus Partial Pair- 
ings for Groups of Fifty. It will be observed in 
Table 1 that the correlations for the two groups 
of 50 ranged from .991 and .994 for pattern A 
to .858 and .888 for pattern J. The decline in 
the correlations is rather consistent with 
reductions in the number of pairs per individ- 
ual, except that patterns I and J, which are 
based on 9 and 7 pairs per person, respectively, 
show more marked decline, and _ greater 
differences between the two groups, than do 
the other patterns. In general, it appears 
that reductions in the number of pairs per 
individual to 21 (pattern F) or to 17 (pattern 
G) apparently can be made with only limited 


effect on the resulting ratings; these patterns 
give correlations in the neighborhood of .95 
and .96. 

Ratings from Complete Versus Partial Pair- 
ings for Groups of Thirty. The correlations 
for the patterns used with the two groups of 
30 ranged from .996 and .994 (pattern A) to 
.948 and .898 (pattern H). Reductions to 
about 12 pairs per individual (pattern K) 
appear to be feasible without affecting materi- 
ally the resulting ratings. 

Ratings Resulting from Random Halves. As 
a supplementary type of analysis, the two 
groups of 50 were split into halves by selecting 
at random the numbers of the employees to 
go into each half. This gave two halves of 
25 each for Group I and for Group II. The 
individuals within each half were then paired 
completely (i.e., each individual was paired 
with each of the other 24 individuals in the 
same half) and performance rating indexes 
were obtained. For Group I and for Group 
II independently, the performance rating in- 
dexes obtained for all individuals (those from 
both random halves) under these conditions 
were then correlated with the indexes ob- 
tained from the original complete pairing. 
The correlations obtained were .974 and .955 
for Groups I and II, respectively. These 
correlations are of essentially the same order 
as those obtained for pattern E in which 
each individual is paired with 25 others. 
This would seem to indicate that with an .V 
of approximately 50, a splitting of the group 
into random halves and pairing each half 
completely will give relatively the same ratings 
as when the original total group is paired 
completely. It also suggests that relatively 
the same results can be obtained when pairings 
are made within the random halves as when a 
pattern of partial pairings is used which 
provides for each individual to be paired with 
approximately half of the others in the total 
group. 


Summary and Conclusions 


Two groups of 50 industrial employees were 
rated independently by their respective fore- 
men using the method of paired comparison; 
all possible pairs of employees were rated. A 
performance rating index was obtained for 
each individual of each group using an index 





Paired Comparison Ratings 


table that is provided with the Personnel 
Comparison System. 

A series of systematic patterns of partial 
pairings were developed for experimental use; 
each such pattern provided for each individual 
to be paired with a specific number of other 
individuals. Ten patterns were developed 
which provided, respectively, for each person 
to be paired with the following numbers of 
others in the group: 40, 35, 32, 28, 25, 21, 17, 
13, 9, 7. The total numbers of pairs for the 
various patterns ranged from 1,000 to 175; a 
complete pairing results in 1,225 pairs. 

Performance rating indexes were computed 
from the ratings made on the pairs included 
in each pattern. These indexes were then cor- 
related with those derived from the complete 
pairing. The range of these correlations was 
from .994 to .858. Correlations of the order 
of approximately .93, for example, were 
obtained with a pattern which reduced the 
total number of pairs from 1,225 to 325. 

Four patterns were also used with two groups 
of 30 individuals extracted randomly from 
the two original groups of 50. The ratings 


resulting from these patterns were correlated 
with the ratings resulting from a complete 


pairing of each of the 30 individuals with all 
of the others. These correlations ranged from 
.996 to .898. 

It should be kept in mind that the coeffi- 
cients of correlation between ratings based on 
partial pairings will be affected, statistically, 
by the fact that the partial pairings are also 
included in the complete set of pairings; there 
is a certain parallel in this situation with that 
in which part scores of a test are correlated with 
total scores. These correlations, therefore, 
should be interpreted as indexes of the extent 
to which various patterns of partial pairings 
can produce ratings which will reproduce the 
ratings from a complete pairing. These cor- 


127 


relations therefore cannot be considered as 
being specifically indicative of the reliability 
of the various patterns. The reliability of 
such a pattern would be largely a function of 
the extent to which different ‘‘samplings”’ of 
rating judgments based on that pattern would 
produce consistently the same rating results. 
The reliability of ratings based on partial 
pairings has been investigated in an associated 
study (2). 

On the basis of the results of the experiment 
the following conclusions seem warranted 
when using the paired comparison system for 
rating employees in groups of approximately 
the sizes of those investigated: 


1. Ratings obtained from partial pairings 
result in fairly high correlations with ratings 
based on complete pairings; the correlations 
are reduced rather systematically with reduc- 
tions in the numbers of pairs per individual 
on which the ratings are based. 

2. Rather substantial reductions can be 
made in the numbers of pairs per individual! 
with only limited reductions in the extent to 
which the resulting ratings differ from those 
obtained with complete pairings. 

3. The potential reduction in the total 
number of pairs to be rated with large groups 
can reasonably be expected to make the paired 
comparison system more practical for use in 
employee rating and for other purposes. 


Received May 28, 1951. 


References 


1. Guilford, J. P. Fundamental statistics in psychology 
and education. New York: McGraw-Hill Book 
Co., Inc., 1950. 

2. McCormick, E. J., and Roberts, W. K. Paired 
comparison ratings. II. The reliability of ratings 
—— on partial pairings. J. appl. Psychol. 
1952, 36, in press. 





Dial Reading Performance as a Function of Brightness ' 


S. D. S. Spragg and M. L. Rock ” 
University of Rochester 


Instrument dials must often be read rapidly 
and accurately under ‘conditions in which it 
is desirable to provide no more than the 
minimum amount of illumination necessary for 
the efficient performance of a task. Such 
conditions are found for example in the airplane 
cockpit during night flying. It has seemed 
desirable in the night operation of military 
aircraft and perhaps somewhat less for com- 
mercial aircraft to attain and preserve as much 
dark adaptation on the part of the pilot and 
co-pilot as is feasible. 

This demand has posed the persistent 
problem of the amount and nature of illumina- 
tion which will best meet the requirements of 
such a situation. A practical solution to the 
problem will obviously be a compromise, but 
it should be based on a determination of the 
effectiveness of visual performance under a 
range of intensities and spectral distributions 
of illumination. From this, one should be 
able to specify the amount and spectral distri- 
bution of illumination which will: (a) permit 
satisfactory performance of visual perceptual 
tasks inside the cockpit (reading dials, etc.); 
and (b) maintain a level of dark adaptation 
sufficient for the pilot and co-pilot to deal 
adequately with visual stimuli coming from 
outside the cockpit. 

As a beginning in a series of studies designed 
to contribute toward the solution of the problem 
experiments have been undertaken attempting 
to relate visual performance (as indicated by 
the speed and accuracy of reading dials) to the 
intensity and to the spectral distribution of the 
illumination provided. 

The present report concerns itself with dial 
reading performance as a function of illumina- 


' The experiments reported here were conducted as 
part of a program of research on human factors related 
to aircraft instrument lighting carried out on a research 
contract (W33-038 ac18317) between the University of 
Rochester and the Air Materiel Command, U. S. Air 
Forces. They have been reported in the following tech- 
nical reports to the Aero Medical Laboratory of the Air 
Materiel Command: MCREXD-694-21 and TR 6040. 

2M. L. R. is now associated with E. N. Hay Asso- 
ciates, Philadelphia. 


tion intensity. Subsequent reports will de- 
scribe comparable experiments using a range 
of colored filters to modify the spectral 
distribution of illumination as well as studies 
of the adequacy of flying a Link Trainer (in 
a task in which the cues are almost completely 
visual) as a function of the above variables. 

Although dial reading is a complex percep- 
tual task rather than a simple acuity function, 
available information on the relationship 
between acuity and illumination is relevant in 
that it may suggest the general nature of the 
function as well as set a lower limit to perform- 
ance. The early studies of Kénig as reported 
in (15, p. 86) as well as other more recent 
studies have indicated that acuity varies as 
the logarithm of illumination intensity, with 
the implication that even at high illuminations 
an increase in illumination will produce some 
increment of acuity. 

Other workers, however, have reported that 
visual acuity increases with illumination incre- 
ments only up to a relatively modest level 
(such as 5 to 10 or 20 foot-candles) and that 
the increase is hardly noticeable beyond this 
range. Carmichael and Dearborn (2) after 
reviewing the relevant acuity and reading 
studies chose an illumination intensity of 16 
foot-candles for their reading experiments, 
considering this value to represent an optimum 
level in view of the available evidence. 

A number of recent studies, both military 
and civilian, have concerned themselves with 
factors determining acuity and other character- 
istics of visual performance as a function of 
illumination level in a variety of task situa- 
tions. This literature has been surveyed, with 
differing emphases, by Fulton and his co- 
workers (5, 6, 7), Lawrence and Macmillan 
(10), Smith and Kappauf (12) as well as others. 
A resumé of that literature will not be under- 
taken here. There still remains, however, 
need for information relating visual perceptual 
tasks (such as dial reading) to a systematically 
varied range of illumination values. Such is 
the aim of the present study. 


128 





Dial Reading Performance as Function of Brightness 


Method 


Two experiments were performed (I and IT). 
Except where otherwise indicated the state- 
ments in this section apply to both experiments. 

Apparatus. The general plan of the appara- 
tus followed that employed by Kappauf and 
his co-workers (8, 9) in their studies of dial 
designs and legibility. The subject was seated 
in a three-sided booth, approximately 4 x 4 
feet, facing the middle wall. In this wall was 
an 11 X 14 inch aperture in which the sample 
dial and the cards containing banks of stimulus 
dials were presented. The center of the 
aperture (and of a bank of dials) was 28 inches 
from the subject’s eyes and 15° below his 
horizontal line of regard. The carrier for the 
dial cards was correspondingly tilted 15° so 
that the surface of the card was normal to the 
subject’s line of regard when directed at the 
center of the bank of dials. An adjustable 
head-rest, mounted on a_ horizontal bar, 
served, to keep the subject’s head in a satisfac- 
torily constant and comfortable position. 

The carrier for the dial cards slid in hori- 
zontally placed brass tracks. It was double 


(i.e., 11 X 28 inches) so that as one card (e.g., 


a sample dial) was slid out of the subject’s view, 
another card (a bank of dials) came immed- 
iately into view. Micro-switches at each end 
of the track were arranged so that the illumi- 
nation on the dial cards went off as the carrier 
was moved from one position and came on as 
it reached the other position. In this way the 
shift from one card to another was accom- 
plished rapidly in a short interval of darkness 
and did not require the subject to make any 
major shift in visual orientation. Thus the 
subject was kept steadily at the chosen level of 
illumination throughout a series of readings, 
except for an instant of darkness between the 
presentation of each stimulus card. 

The experimenter was seated at a small 
work table placed against the outside of the 
middle wall of the booth. The card carrier 
was in front of him within easy reach and to 
his side was a bin containing the supply of 
stimulus cards. 

Pairs of Mazda lamps served as light sources. 
They were mounted on the horizontal bar 
that carried the subject’s head-rest, about 18 
inches on each side of the head-rest. For the 


129 


four lowest intensities two Air Force cockpit 
lamps, type C-4, were used; for the 6 foot- 
lambert intensity 115v. 25w. Mazda lamps in 
cans were employed. The lamps were care- 
fully adjusted so that the stimulus cards were 
evenly illuminated. 

Voltage was maintained at a constant level 
by means of a Variac, Model V-5MT, and 
a monitoring Weston AC voltmeter, Model 
433, on the experimenter’s desk. The lamps 
were operated at less than rated voltage, 
41v. in the case of the cockpit lamps (wired in 
series) and at 93v. for the two 25w. Mazda 
lamps, in order to increase their stability. 
The color temperature of both was in the 
neighborhood of 2400° K. 

Chosen levels of illumination were achieved 
by means of accurately drilled apertures in 
removable brass plates. All light sources had 
two ground-glass surfaces in the optical 
pathway to achieve high dispersion. 

Materials. The stimulus materials consisted 
of high-contrast, white on black photographic 
reproductions of dial setting. Each stimulus 
card showed 12 dials—three rows of four dials 
each. The dials used in Experiment I were 
2.8 inches in diameter, constructed according 
to Air Force specifications, but with a scale of 
100 units divided by numbers and scale marks 
at every 10 units. Figure 1 shows a represen- 
tative bank of dials. Sample’ dials were 
identical to the stimulus dials except that they 
lacked a pointer. The dials for Experiment 
II were the same as those for Experiment I 
except for two changes: they were 1.4 inches 
in diameter, and had scale marks for every unit 
on the 100 unit scale. These dials, chosen 
from a wide variety developed by Dr. William 
Kappauf and his associates at Princeton 
University, were selected because they had 
been shown to constitute a relatively difficult 
perceptual task with a fairly high proportion 
of errors.’ Details of the construction of the 
dial cards and some experimental results have 
been reported (9). 

For Experiment I five of the stimulus cards 
(each containing 12 dials) were selected. Each 
was cut vertically into equal halves thus 
yielding ten cards, each containing three rows 

* This project is grateful to Dr. Kappauf for his 


generosity in making available these stimulus materials, 
and for his many valuable suggestions. 











S. D. S. Spragg and M. L. Rock 


Fic. 1. 


of two dials each. These were mounted on 
masonite board. For a given reading any 
two of the cards were selected and placed 
together to form the left and right halves of 
a full 12-dial stimulus card. This procedure 
made available a large number of combinations 
of half-cards, and reduced the chances for 
distortion of results due to remembering 
certain recurring combinations or sequences. 
A given stimulus combination (12 dials) thus 
appeared only once during the course of the 
experiment, even though each half-card (6 
dials) appeared 6 times, 3 on the right and 3 
on the left, during the course of the training 
and formal trials. A counter-balanced se- 
quence was also employed so that the appear- 
ance of the half-cards was distributed through- 
out the course of the readings. 

Data sheets were prepared in advance for 
each stimulus combination and for each ex- 
perimental sequence used. These indicated 
the correct settings, with adjoining spaces for 


recording the subject’s responses and provision 
for recording time, total errors, and other 
relevant information. 

Subjects. 
Experiment I. All 
University of 


Twenty male subjects served in 
were students at the 
Rochester (5 graduates, 15 


A representative bank of dials, 2.8 diameter, 100 X 10 scale. 


undergraduates) and were in their late teens 
or twenties in age. Subjects chosen were those 
who passed a rigorous visual screening,’ using 
the Keystone Telebinocular. All subjects 
had: normal ophthalmoscopy; 20/20 visual 
acuity, monocularly and binocularly, at dist- 
ance and near, without glasses; 80% or better 
stereopsis; no vertical imbalance; less than 6 
prism diopters physiological exophoria; less 
than 6 prism diopters exophoria at distance; 
and normal color vision. 
for their services. 


Subjects were paid 


Procedures. Each subject in Experiment I 
was allowed to become cone dark adapted 
(approximately 10 minutes) before the illumi- 
nation was turned on. He was then shown in 
the aperture a sample dial under the illumina- 
tion to be used first. The instructions called 
his attention to the dial and its scale and con- 
tinued as follows: ‘‘When I say ‘Ready’ the 
lights will go off and in a moment they will 
come on again. When they come on, you are 
to read the settings on the dials, reading from 
left to right, first the top row, then the second, 
then the third. Read the dials to the nearest 
unit, such as 61, 38, 43, etc. Read the dials as 
rapidly and as accurately as you can.” 

On each trial a ‘‘ready”’ signal was given, the 
lights went off briefly as the card to be read was 


* By one of us (M. L. R.), a graduate optometrist. 





Dial Reading Performance as Function of Brightness 


slid into position, and then came on showing 
the card of 12 dials in position. 

Six cards (72 dials) were read before formal 
trials were begun; this fore-test served to reduce 
practice effects during the experiment. 

On the formal trials each subject read 10 
cards of dials at each of 5 brightness levels. 
Subject’s responses were recorded as described 
earlier. Time was recorded by the experi- 
menter’s starting a Standard Electric Timer as 
the subject read the first dial and stopping it 
as he read the eleventh dial. The first and 
last dial readings in each card were eliminated 
from both the time and error data because of 
their relative unreliability. Evidence in sup- 
port of this procedure has been reported else- 
where (9, pp. 37-38). Thus the data for each 
subject consist of 100 dials read at each of five 
brightness levels. 

The five brightness levels used in Experi- 
ment I were chosen as a result of exploratory 
experimentation which indicated that a rather 
sharp change in the difficulty of the dial- 
reading task occurs at a brightness of about .02 
foot-lamberts.® For this experiment, therefore, 
two values were chosen which would bracket 
the suggested transition level, a third value was 
chosen to be slightly above cone threshold for 
the cone dark-adapted eye, a fourth at 6 foot- 
lamberts (the level which Kappauf, Smith, and 
Bray (9) used) and a fifth at an intermediate 
level. The values selected were: 0.005, 0.018, 
0.022, 0.296, and 6.0 foot-lamberts. 

Brightness measurements were made with a 
Macbeth illuminometer used in the subject's 
position, and directed against an 11 & 14 inch 
sheet of unexposed but fixed photographic 
paper from the same stock as that of the dial 
reproductions. Thus its ‘“‘whiteness’’ (reflect- 
ance) was equivalent to that of the white 
markings—numbers, pointer, scale markings- 
of the dials used. The contrast between white 
and black areas on the dials was somewhat 
greater than 10 to 1. 

A considerable number of brightness readings 
at each level was made by each of the two 
writers and the accepted value in each case was 
taken as the average of the two observers’ 
median reading. Agreement was close, being 
within 5-10 for all levels. 

Since five levels of illumination were used 
varied sequences of brightness levels were em- 
ployed to balance practice and fatigue effects 

5 The foot-lambert is a measure of the density of light 
flux (luminance) which is reflected from a diffusing 
surface. The perhaps more commonly known foot- 
candle is a measure of the density of light flux (illumi 
nance) falling upon asurface. The relationship between 
B (in foot-lamberts) and E (in foot-candles) is: B = RE, 
where E is a value for the reflectance of the surface in 


question (e.g., 40%, 75%, etc.). 


131 


The only restriction imposed was that a series 
of readings at the brightest level should never 
be immediately followed by a series at the 
dimmest level. As a further precaution, in 
changing from one brightness level to another 
the subject was given from 5-10 minutes for 
adaptation, with the light at the new level 
illuminating the sample dial. 

Each subject was tested at two sessions, 
several days apart. At the first session sub- 
jects were given the visual screening tests, the 
practice trials, and tests at the first two bright- 
ness levels to be used for that subject. At the 
second session some further informal practice 
was given, then tests on the remaining three 
brightness levels. 

Subjects were given no knowledge of results, 
i.e., they were not told the correct readings nor 
whether their readings were correct or wrong. 


Results 


Experiment I. The data of this experiment 
consist of error scores and time scores made 
by the group of 20 subjects, under the 5 levels 
of illumination employed. 

The principal analysis of errors is in terms of 
error frequency, i.e., the number of readings in 
error without regard to the magnitude of error. 
Thus an error of one unit has equal status with 
an error of four or ten units in such an analysis. 
Table 1 summarizes the mean error frequencies 
for the five brightness levels used. Each mean 
is based on 100 dials read by each of 20 
subjects. 

Data on speed of dial reading consist of times 
required to read the middle 10 of each card of 
12 dials. They are summarized as mean 
reading time per dial in Table 1. Each mean 


Table 1 


Dial Reading Performance at Five Brightness Levels, 
2.8 inch, 100 & 10 Dials (N = 20) 


Mean 
Number 
(and %) of 
Readings 
in Error 
in Reading 
100 Dials 

67.3 10.0 
59.9 14.1 
0.022 30.1 8.1 
0.296 27.8 5.5 
60 27.8 4.4 


Mean 
Reading 
Time per Standard 
Dial, in Devia- 
Seconds tion 
2.84 .93 
2.64 74 
1.52 21 
1.33 21 
1.30 .22 


Bright- 
ness, in 
Foot- 
Lamberts 


0.005 
0.018 


Standard 
Devia 
tion 











CR TERR er 


S. D. S. Spragg and M. L. Rock 





2.8" Dias 


1.4" DIALS 








PERCENT READINGS IN ERROR 





° 1 
Loc I, IN FOOT-LAMBERTS 
Fic. 2. Frequency of errors in reading 2.8 inch, 


100 < 10 dials and 1.4 inch, 100 X 1 dials as a function 
of brightness. 


is based on 2000 readings (100 dials read by 
each of 20 subjects).® 

The error frequency data are summarized in 
Figure 2 (results for 2.8 inch dials) and the 
reading time data in Figure 3 (2.8 inch dials). 
These two figures are seen to be highly similar.’ 
Both indicate that in this visual task there is 
marked improvement with illumination in- 
crease up to approximately 0.02 foot-lamberts 
and relatively little improvement thereafter 
at least up to 6.0 foot-lamberts. We have 
made informal observations indicating no 
significant improvement at levels considerably 
higher than this. 

Since our principal concern was with dial 
reading performance as a function of brightness 
level statistical analysis consisted primarily of 
t tests comparing performance between the 
several brightness levels, both for errors and 
for time. These are summarized in Table 2. 
From this table it is seen that all the differences 
and only the differences which cross the 0.02 
foot-lambert value are significant at the 1% 
level. Except for one instance no difference 
that does not cross the 0.02 foot-lambert value 


® Detailed results for errors and times have been pre- 
sented in the original technical reports of these experi- 
ments (13, 14). ‘ 

7In Figures 2 and 3 the data have been plotted 
against the logarithm of brightness since an arithmetic 
plot would involve a very lengthy scale and the slopes 
of all four of the curves would appear very nearly 
vertical. The logarithm values for the brightnesses 
used (other than the obvious ones) are shown as follows 
in the parentheses: 6.0 (0.778); 0.296 (1.471); 0.022 
(2.342); 0.018 (2.255); 0.005 (3.699). 


Table 2 


Values of ¢, Comparing Dial Reading Performance at 
Five Brightness Levels, 2.8 inch, 
100 X 10 Dials 








Brightness, in Foot-Lamberts 


0.005 0.018 0.022 0.296 6.0 


?’s between Error Means 
2.70* a 
12.89** 14.39** _— 
16.61** 13.62** 1.75 
15.70** 8.84** 1.58 


?’s between Time Means 
0.018 1.75 _- 
0.022 6.82** 6.60** —_ 
0.296 7.60** 8.05** 1.90 a 
6.0 9.40** 9.86** 1.94 1.68 





* Significant at 5% level. 
** Significant at 1% level. 


is significant at even the 5% level. It seems 
clear that in terms of speed as well as accuracy 
there is a highly significant improvement in 
dial reading performance when the brightness 
level is increased from values below 0.02 foot- 
lamberts to measured values above 0.02 foot- 
lamberts, and that further increases up to 
6.0 foot-lamberts bring little or no increment in 
performance. 

The distribution of errors with respect to 
magnitude of error is summarized in Table 3. 





1.4" DIALS 


2.8" DIALS 











MEAN TIME PER DIAL, IN SECONDS 


° ' 
Loc I, IN FOOT - LAMBERTS 
Fic. 3. Mean time required to read 2.8 inch, 


100 < 10 dials and 1.4 inch, 100 X 1 dials as a function 
of brightness. 





Dial Reading Performance as Function of Brightness 


Although at all brightness levels errors of 1 
scale unit are in the majority, it is clear that 
the distribution of errors is markedly different 
above and below 0.02 foot-lamberts. Above 
this value errors of 1 and 2 scale units account 
for 95% to 96% of all errors made. For the 
two brightness levels below 0.02 foot-lamberts, 
however, errors of 1 and 2 scale units account 
for only about 75% of the errors and errors 
of greater magnitude are much more frequent. 

The large magnitude errors (10 scale units 
and over) at the two lower brightness levels 
were mostly errors of 48 to 50 scale units. 
At these levels subjects were at times uncertain 
as to which was the pointer end and which the 
reverse end of the indicator. At the two 
lowest brightness levels about 1 reading in 18 
was a reversal (error of approximately 180°). 
At levels above 0.02 foot-lamberts this type of 
error was completely absent in 6000 dial 
readings. 

Analysis of possible practice effects in this 
experiment was made by pooling the data for 
first brightness level tested, second level tested, 
etc. Since each brightness level appeared in 
each ordinal position an equal number of times, 
no advantage due to sequence is present for 
any brightness level. The results, both for 
time and for error scores, show no evidence of 
a practice effect for the experiment as a whole. 
In fact there was some decrement in perform- 
ance on each of the two days as testing 
continued. This would suggest that motiva- 
tional and fatigue factors may have been more 
important here than practice effects. Subjects’ 


Table 3 


The Proportion of Errors Occurring at Differing 
Magnitudes, for Each Brightness Level 











Magnitude of : gor £) 
oe in Brightness, in Foot-Lamberts 


0.005 0.018 0.022 0.296 6.0 


Scale Units 
1 54% 58% 85% 0% 87% 
2 18 19 10 6 
3to9 19 15 3 1 
10 and over 9 Ss 2 3 


Total per cent 100 100 
Total number of 


readings in error 1346 1198 





Table 4 


Dial Reading Performance at Five Brightness Levels, 
1.4 inch, 100 X 1 Dials (N = 10) 


Mean 
Number of 
Readings 
in Error 
in Reading 
50 Dials 


Mean 
Reading 
Time per 
Dial, in 
Seconds 


Bright- 
ness, in 
Foot- 
Lamberts 
0.005 31.3 8.0 
0.01 20.8 7.8 
0.05 5.7 4.0 1.7 
0.1 3.9 2.8 1.7 
1.0 3.2 2.1 1.5, 


Standard 
Devia- 
tion 
1.33 
0.66 
0.21 
0.24 
\ 0.21 


Standard 
Devia- 
tion 
3.45 


2.79 


comments indicated that the degree of concen- 
tration required made this an arduous task. 

Experiment II. The findings reported above 
were based on fairly large dials with widely- 
spaced scale divisions. In order to test the 
generality of these findings a second experiment 
was run using smaller dials (1.4 inch diameter) 
and finer scale division spacings (a scale mark 
for each unit of the 100 unit scale). With 
these exceptions the stimulus materials, ap- 
paratus, and general procedures were the same 
as for Experiment I. Five brightness levels 
were chosen: 0.005, 0.01, 0.05, 0.1, and 1.0 
foot-lamberts sampling in a somewhat different 
manner the approximate brightness range 
employed in Experiment I. 

Subjects were ten male students selected in 
the same manner as for Experiment I. Each 
subject was given five cards (of 12 dials each) 
to read as preliminary practice. On the 
formal trials each subject read five cards at 
each of the five brightness levels tested. 
Since the data are based on the middle 10 dials 
of each card the results consist of 50 dial 
readings at each brightness level for each of 
ten subjects. 

Table 4 summarizes the mean error fre- 
quencies and the mean reading time per dial 
for the five brightness levels. The error 
frequency data are also presented graphically 
in Figure 2 (results for 1.4 inch dials) and the 
mean reading time data in Figure 3 (1.4 inch 
dials). 

A ¢ test analysis comparing performance 
between the various brightness levels, both for 
time and for error frequency means, is sum- 
marized in Table 5. 








S. D. S. Spragg and M. L. Rock 


Table 5 


Values of ¢, Comparing Dial Reading Performance at 
Five Brightness Levels, 1.4 inch, 
100 X 1 Dials 


Brightness, in Foot-Lamberts 
0.005 0.01 0.05 0.1 


’s between Error Means 
$a0°" — 
11.48** 6.43** — 
10.66** 7.82** 1.30 
9.53** sar 1.43 


?’s between Time Means 
0.005 - - — 
0.01 1.67 - 
0.05 3.65** 5.85** — 
01 3.67** 6.25** 2.46* — 
1.0 4.15** a 9.08** 4.28** 
* Significant at the 5% level. 
** Significant at the 1% level. 





The results presented indicate that as bright- 
ness level is increased dial reading accuracy and 
speed improve markedly up to 0.05 foot- 
lamberts, but above this level increments of 
improvement are much less. Analysis of the 
data shows that for errors there is no signif- 
icant improvement above 0.05 foot-lamberts. 
For time, however, there is significant improve- 
ment throughout the brightness range tested 
even though the absolute change is much less 
above 0.05 foot-lamberts. 

These results and inspection of the two sets 
of curves in Figures 2 and 3 indicate that the 
findings of both experiments are in essential 
agreement. Both the time curves and the 
error curves indicate that dial reading perform- 
ance becomes markedly poorer as brightness 
falls below the level of approximately 0.02 
foot-lamberts. At 0.005 foot-lamberts (which 
is slightly above cone threshold) it takes 
roughly three to three and one-half seconds to 
read a dial, and approximately two-thirds of 
the readings are in error. 

The fact that the performance curves suggest 
something approaching a plateau or, at least, 
a relatively gentle slope between the two 
lowest brightness levels for the 2.8 inch dials, 
whereas the curves for the 1.4 inch dials rise 
rapidly and steadily below 0.05 foot-lamberts, 


is believed not to be a serious discrepancy. 
The hint of a plateau or moderation of slope in 
the 2.8 inch dial data may have been affected 
by the choice of brightness levels, by some 
aspect of the 2.8 inch dials (such as absence 
of fine scale marks), or may have been a result 
of some aspect of the sampling. On the basis 
of Rock’s findings with a series of four widely 
differing visual tasks (11), and in view of the 
common-sense consideration that if brightness 
is pushed down to cone threshold and below the 
time and error scores in dial reading will 
certainly rise to very high levels, it would seem 
reasonable to hypothesize that the performance 
curves for the 1.4 inch dials more closely 
represent the situation in the region from cone 
threshold up to 0.05 foot-lamberts. 

For both types of dial evidence for a plateau 
or near plateau at values above the 0.02 
foot-lambert region is considerable. In both 
experiments performance increments become 
slight at brightnesses above this region. For 
the higher values the time required to read 
each dial is of the order of 1.3 to 1.5 seconds. 

Some discrepancy is seen between the error 
scores for the two sizes of dial at the three 
highest brightness levels. For the 1.4 inch, 
finely spaced dials the proportion of the read- 
ings in error is 10 per cent or less. For the 
2.8 inch, coarsely graduated dials the propor- 
tion is somewhat greater than 25 per cent. 
It is believed that this difference can be 
explained by the differences in scale markings 
between the two dials. Given sufficient 
brightness the 1.4 inch, 100 X 1 dials can be 
read with considerable accuracy because no 
interpolation is required. The 2.8 inch dials, 
however, continue to require interpolation 
judgments at these levels of brightness as well 
as at the lower levels. At the lowest bright- 
ness level the small dials also probably require 
a good deal of interpolation judgment because 
the minor scale marks have become difficult or 
impossible to see. Hence the results for the 
two kinds of dial agree closely at these levels. 


Discussion and Conclusions 


The results of the time scores and error 
scores reported above indicate clearly that 
there is a critical level of brightness (about 
0.02 foot-lamberts) below which subjects find 
it difficult to perform the dial reading task, as 





Dial Reading Performance as Function of Brightness 


shown by relatively slow responses and greater 
frequency and magnitude of errors. Above 
this level the task becomes suddenly much 
easier, responses are quicker and frequency 
and magnitude of errors much less. Further 
increases in brightness, however (at least up 
to 6.0 foot-lamberts and very probably in- 
definitely), produce no further increments of 
performance. It seems as though once a 
subject has been given enough brightness to 
perform this task with ease, brightness is no 
longer a significant variable. 

These findings have recently been corrob- 
orated by two studies from other laboratories, 
reported after the present experiments were 
completed, and from a later series of experi- 
ments by Rock (11) in this laboratory. From 
the Tufts laboratory Crook and his co-workers 
(4) have reported results on a task involving 
the reading of numerals with brightnesses 
ranging from 15 to 0.01 foot-lamberts. Their 
findings for 10 point type size are in very close 
agreement, both for time and for error curves, 
with the comparable experiment reported by 
Rock (11) and are in general agreement with 
our dial studies. Their results show that the 
region of 0.02 to 0.04 foot-lamberts is critical 
for their visual perceptual task, performance 
dropping off sharply below this level but 
showing little or no increment above this 
level. 

At the Princeton laboratory Kappauf and 
his colleagues (3) have carried out dial reading 
studies on a variety of dial types, using bright- 
nesses ranging from 2.7 to 0.0009 foot-lamberts. 
Although there is much variation in the values 
for percentage of readings in error in their 
experiment the general shape of their curves 
for 1.4 inch diameter dials (both for errors and 
for time) agrees very closely with the results 
for 1.4 inch dials reported here (Experiment 


II) in that they indicate a critical brightness 


level slightly above 0.01 foot-lamberts. 

Their over-all results for 2.8 inch dials 
would suggest a critical brightness level at 
about 0.007 foot-lamberts, a value somewhat 
lower than that found in our Experiment I 
above. If, however, we plot their results for 
the 2.8 inch 100 X 10 dials (duplicates of which 
we used in our Experiment I) we find that, 
allowing for some irregularities in their ob- 


135 


tained percentages of readings in error, a 
smoothing of their 100 X 10 results would 
locate the critica] brightness level at about 0.01 
foot-lamberts. This should be regarded as 
good agreement with our present results. 

These several findings are in interesting 
contrast with Kénig’s classical curve relating 
acuity to brightness, and to the findings of 
Hecht and certain other recent investigators 
that acuity continues to increase with increases 
in brightness, even at very high brightness 
levels. Other workers whose data indicate that 
acuity ceases to increase beyond a certain 
brightness level have usually reported that 
their curves do not flatten out until an illum- 
ination of about 5 to 10 foot-candles has been 
reached. Wy 

There is no fundamental discrepancy be- 
tween such findings and the present results. 
Acuity studies deal with threshold phenomena 
and relatively simple stimulus materials while 
our data are from a relatively complex visual 
task in which the digits and the pointer are 
well above threshold size. Performance is 
thus a function not so much of acuity as of 
speed and accuracy in making a visual judg- 
ment which often requires an interpolation. 
Hence the lack of close correspondence between 
our results and the earlier acuity studies should 
occasion no dismay. 

In our dial reading task one important 
variable is the effective or subjective contrast 
between white figure and dark background. 
It is true of course that contrast, defined 
physically, is independent of illumination. At 
our low brightness levels however there are 
obviously fewer j.n.d.’s of brightness between 
figure and ground than there are at brightness 
values which are great enough so that the 
psychophysical function is approximately con- 
stant. Some approximate calculations based 
on Blackwell’s contrast threshold data (1), 
and assuming a brightness ratio of 10 to 1 
between our stimulus figures and their back- 
ground, indicate that for a background bright- 
ness of 0.001 foot-lamberts a stimulus figure 
would have to have a brightness very close 
to 0.01 foot-lamberts to be at liminal contrast. 
This means that for a figure brightness of 0.02 
foot-lamberts the figure-ground contrast is 
approximately 2 j.n.d.’s. By comparison, at 








136 


a figure brightness of 0.1 foot-lamberts the 
figure-ground contrast would be roughly 7 
j.n.d.’s. Thus our critical 0.02 foot-lambert 
level represents a value which provides barely 
adequate subjective contrast; for values below 
this level performance suffers decrement due 
to insufficient subjective contrast (expressed 
in j.n.d.’s) while for values above this level 
contrast is sufficiently great that it no longer is 
a significant variable for the task in question. 

From a practical standpoint the results of 
the present study indicate that in visual 
perceptual tasks of this nature where maximum 
performance is desired with a minimum of 
brightness (in order, for example, to conserve 
dark adaptation) care should be taken to keep 
the brightness level safely above this critical 
region.(0.02 to 0.05 foot-lamberts). 

These findings have implications for the 
night operation of civilian and military equip- 
ment such as aircraft, and in general for the 
viewing of complex visual stimuli at low levels 
of illumination. They indicate that if the 
visual material to be dealt with has a brightness 
safely above the critical value then visual 
perception will be as rapid and as accurate as 
it would be at higher brightness levels (at least 


up to 6 foot-lamberts, and possibly indef- 
initely). 

Two limiting conditions should be kept in 
mind in connection with generalizations and 


applications of the present findings. The first 
is that these data have been gathered on 
photographic reproductions of dials rather 
than actual dials. Thus a parallax error due 
to angle of viewing the dial is not possible. 
Contrast may not be quite as high as on 
instrument dials and reflections from a glass 
face are lacking. In spite of these differences 
it is believed that the function measured is of 
such fundamental validity that it will apply 
to many other dial reading and similar tasks. 

A second limitation to the present findings 
inheres in the fact that the data were taken 
under conditions in which fatigue effects were 
probably not a significant variable. It may 
be that for a long-continued task of this 
general nature a minimum brightness value 
should be recommended which would be higher 
than that suggested by the present experiment. 
Further research is needed to supply informa- 
tion here. 


S. D. S. Spragg and M. L. Rock 


Summary 


Experiments are reported on the speed and 
accuracy with which subjects can read photo- 
graphic reproductions of instrument dials as a 
function of the brightness of the dial markings. 

Young adult males, rigorously screened so 
that they constituted groups with excellent 
visual abilities, served as subjects in dial 
reading tasks. A brightness range of 0.005 to 
6.0 foot-lamberts was used. Both for time 
and for error frequency scores a critical 
brightness level was found at approximately 
0.02 foot-lamberts. At brightnesses below 
this level performance was increasingly im- 
paired; above this level increases in brightness 
produced little or no improvement in visual 
performance. 

These findings suggest that for the night- 
time operation of equipment where dial-reading 
and comparable visual tasks are involved 
brightness values should be kept safely above 
the critical 0.02 foot-lambert level. As long 
as this is done visual performance will be as 
rapid and as accurate as at higher levels (i.e., 
brightness ceases to be a significant variable). 


Received January 24, 1952. 
Early publication. 


References 


. Blackwell, H.R. Contrast thresholds of the human 
eye. J. opt. Soc. Amer., 1946, 36, 624-643. 

. Carmichael, L., and Dearborn, W. F. Reading and 
visual fatigue. New York: Houghton Mifflin, 
1947. 

. Chalmers, E. L., Goldstein, M., and Kappauf, W. E. 
The effect of illumination on dial reading. USAF 
Technical Report No. 6021, Air Materiel Com- 
mand. August, 1950. 25 p. 

. Crook, M. N., Harker, G. S., Hoffman, A. C., and 
Kennedy, J. L. Effect of amplitude of apparent 
vibration, brightness, and type size on numeral 
reading. USAF Technical Report No. 6246, Air 
Materiel Command. September, 1950. 54 p. 

. Fulton, J. F., Hoff, P. M., and Perkins, H. T. 
A bibliography of visual literature, 1939-1944. 
Menasha, Wis.: Geo. Banta, 1945. 

. Fulton, J. F., Marquis, D. G., Perkins, H. T., and 
Hoff, P. M. A bibliography of visual literature, 
1939-1944. Supplement. Menasha, Wis.: Geo. 
Banta, 1945. 

. Hoff, E. C., and Fulton, J. F. A bibliography of 
aviation medicine. Baltimore: C. C. Thomas, 
1942. 


. Kappauf, W. E., and Smith, W. M. Design of 





Dial Reading Performance as Function of Brightness 137 


instrument dials for maximum legibility. II. A 
preliminary ex periment on dial size and graduation. 
USAF Memorandum Report No. MCREXD- 
694-1N, Air Materiel Command. July, 1948. 
16 p. 

9. Kappauf, W. E., Smith, W. M., and Bray, C. W. 
Design of instrument dials for maximum legibility. 
I. Development of methodology and some prelimi- 
nary results. USAF Memorandum Report No. 
TSEAA-694-1L, Air Materiel Command. Oc- 
tober, 1947. 42 p. 

10. Lawrence, M., and Macmillan, J. W. Annotated 
bibliography on human factors in engineering de- 
sign. Aviat. Br., Res. Div., BuMed., U. S. 
Navy, 1946. 

11. Rock, M. L. Visual performance as a function of 
low photopic brightness levels. USAF Technical 
Report No. 6013, Air Materiel Command. No- 
vember, 1950. 31 p. 


12. Smith, W. M., and Kappauf, W. E. Studies per- 
taining to the design and use of visual displays for 
aircraft instruments, computers, maps, charts, and 
tables: a bibliography. USAF Memorandum Re- 
port No. TSEAA-694-1G, Air Materiel Com- 
mand. May, 1947. 25 p. 

. Spragg, S. D. S., and Rock, M. L. Dial reading 
performance as related to illumination variables. 
I. Intensity. USAF Memorandum Report No. 
MCREXD-694-21, Air Materiel Command. 
October, 1948. 29 p. 

. Spragg, S. D. S., and Rock, M. L. Dial reading 
performance as related to illumination variables. 
IIT. Results with small dials. USAF Technical 
Report No. 6040, Air Materiel Command. No- 
vember, 1950. 8 p. 

5. Troland, L. T. The principles of psychophysiology. 
Vol. IT. Sensation. New York: D. Van Nos- 
trand, 1930. 


| 
; 








Critique of Rock’s “‘A Sales Situation Test” 


Jack Bernard * 


The Du Bois Company, Cincinnati, Ohio 


Motivations for the publication of articles in 
scientific journals are unquestionably varied. 
One may publish out of desire to spread 
scientific findings, out of desire for self-adver- 
tisement, out of desire for advertising his 
wares (as in the case of new apparatus), or for 
any combination of these or other motives. 
Where quality and scientific accuracy are 
maintained at a high level resulting in the 
presentation of something which is of value, 
questions of motivation are academic. How- 
ever, where scientific accuracy is remarkable 
largely through its absence and exhortative 
conclusions are drawn which are founded on 
the quicksands of inadequate statistics, the 
question may be legitimately asked: Is it 
science or salesmanship? Rock’s article on his 
“Sales Situations Test” is a case in point.' 

Rock informs us that the “sales situations” 
(items) in his test came from sales managers 
who were asked to make up situations calling 
for sales judgment. These were then edifed. 
Whether coincidence crept in during the 
writing or during the editing is an open ques- 
tion, but 13 out of Rock’s 25 questions (more 
than 50 per cent of the test) wind up as para- 
phrases of questions in a ‘“‘sales sense” test 
developed a number of years earlier by Can- 
field.2. For example: 


Rock, item 1: 

A salesman has difficulty in getting in to see 
his prospect. The executive's secretary refuses 
to admit him to see her employer. The sales- 
man’s procedure in this situation should be: 

Put his proposition in writing for the 
prospect. 

Interview a minor official and through 
him reach the prospect. 

Sell the secretary, hoping she will sell 
the boss. 

Obtain the secretary's cooperation 
through favors. 


* Formerly Chief Psychologist with The Klein Insti- 
tute for Aptitude Testing, Inc. 

Rock, M. L.A sales situation test. 
chol., 1951, 35, 331-332. 

2? Canfield, B. R. How perfect is your “Sales Sense’? 
New York: The Klein Institute for Aptitude Testing, 
Inc., 1945. 


J. appl. Psy- 


Canfield, item 46: 

A salesman calling on a business executive 
experienced difficulty in securing an interview. 
The prospect's secretary refused to admit the 
salesman to her employer. The salesman’s 
procedure in this situation should have followed 
which one of the following courses: 


(1) Interview a minor official and through 
him reach his superior. 

(2) Put his proposition in writing for the 
prospect. 

(3) Sell the secretary, hoping that she will 
sell the employer. 

(4) Obtain the cooperation of the secretary 
in getting an interview. 


Nor is this exceptional. 
equally parallel: 


The following are 


Items 
Rock: 2 4 6 7 8 911 13 15 19 21 22 
Canfield: 21 40 45 33 37 38 34 17 22 23 13 18 


When simple failure to give due credit is 
regarded as a breach of ethics, what must we 
consider a claim of originality for paraphrased 
material? 

As for the attempt to justify statistically the 
“close” (this sales terminology seems to apply 
better than “Summary’’), it would be only 
Christian charity to trust that “he knew not 
what he did.” 

Briefly: No reliability data are presented 
when: (a) low reliability is typical of the few 
tests available in this field; and (b) low reliabil- 
ity is the rule on questionnaire-type tests 
containing so few (25) items. 

Statistics were computed and conclusions 
drawn based on three samples of 25, 26, and 31 
persons each. Why the rush to publish with 
notoriously unreliable small sample analyses 
when considerably larger populations were 
available? The notation, “Early publication” 
appended to the article might well read, 
‘Premature publication.” 

As an offshoot of the above come questions 
as to the representativeness of the samples 
chosen. Are the “production supervisors” 
representative of “nomsalesmen”? This is 


138 





Answer lo Bernard’s Critique of Rock’s **A Sales Situation Test” 


very doubtful. Are the “consumer salesmen”’ 
and “industrial salesmen” groups typical of 
those categories? The present writer, having 
seen thousands of Canfield tests taken by 
salesmen, would expect the reverse of the 
difference reported by Rock. 

Where does the blame lie? With the young 
author who might understandably place the 
prestige of authorship above scientific caution 
and accuracy? With the harried editorial 
staff? And more important still, what can be 


139 


done to protect readers of scientific journals 
from the fallacy of “it is printed and therefore 
it is fact’’? 

In the opinion of the present writer, the 
principal contribution made by Rock in 
publishing his “A Sales Situation Test” is to 
call our attention once more to a crying need 
for a more rigorous screening of articles 
submitted to the psychological journals. 


Received January 5, 1952. 
Published out of turn by the editor. 


Answer to Bernard’s Critique of Rock’s “A Sales Situation Test”’ 


Milton L. Rock 
Edward N. Hay & Associates, Inc., Philadelphia, Pa 


Professional people are motivated to publish 
articles chiefly to make available to other 
workers methods used and results secured on a 
problem. This is important so that men busy 
working in the field will have a body of data 
available to guide them in research problems 
in order that they do not re-do work that has 
already been accomplished. 

Gathering items for a test from people 
working in the field is common sense but it is 
now apparent that along with this, there are 
also some possible disadvantages. It is possible 
that some of the items secured in this manner 
may by accident duplicate items in other tests. 
This can only be avoided—in order to give 
due credit to the originator of the items—if 
their tests are known and can be studied. 
Canfield’s How Perfect is Your Sales Sense 
test is a test that is difficult to obtain. In 
going through the literature from 1945 to the 
present—Psychological Abstracts as well as 
commercial literature—an article by Fleming 
and Fleming! has come to light, in which they 
mention the Canfield test. But there is 
no description, no data and no reference to 
the test in Fleming’s bibliography. Until 
Bernard’s criticism, I had no idea that a 
Canfield test existed. 

As described in my report of October, 1951, 
the method of constructing this test was to 


1 Fleming, E. G., and Fleming, ©. W. Qualitative 
approach to the problem of improving selection of sales- 
men by psychological tests. J. Psychol., 1946, 21, 127 

50. 


obtain sales situation descriptions from my 
own experience and that of client organizations. 

For the record, Company No. 1 in my 
report was Scott Paper Company, Chester, Pa. 
Company No. 2 was a large and prominent 
company in the Middle West (name on 
request). Credit for help should also be 
given to National Drying Machinery Com- 
pany, Philadelphia, Pa., and U. S. Fidelity & 
Guaranty Company, Baltimore, Md., and 
other companies. 

The only possibility that any questions were 
based on items in the Canfield test is that one 
of these companies had used the Canfield test 
and constructed new items based on items in 
that test. This possibility has never before 
arisen and there has been no time to communi- 
cate with these sources and find out. If this 
happened, it is unfortunate. However, the 
situations which arise in salesmen’s calls are 
somewhat universal. Ifit were otherwise, there 
would be small point in including them in a 
test of sales knowledge. The constant re- 
appearance in mental tests of the same situa- 
tions phrased differently is well known. 

Concerning the population, the article was 
intended to be an introductory article on the 
problem of salesmen selection and definitely 
stated the size of the samples and used small 
sample statistics. The production supervisor 
group was one of four of the same size tested 
over the past year and a half and the results 
on the battery of tests given to the other three 








140 


groups, as described in the article, indicate 
that this group is representative of the produc- 
tion supervisors of this company. The sales- 
men populations were as follows: We received 
31 out of a total of 34 from Company No. 2 
and in the Scott Paper Company, 25 were 
distributed geographically and we received 
them all. 

At this time, as mentioned in the article, we 
are trying to broaden the sample to include a 
variety of vocational situations in order that 
we may use a follow-up method to see its 
predictive value in the selection of salesmen. 
It may be interesting to note that another 
prominent company tested 17 industrial sales- 
men. The results showed a range from 17 to 


Donald G. Paterson 


32 with a mean of 23.6 and a sigma of 4.4. 
This shows no significant difference from the 
technical salesmen as tested in Company No. 2. 

Messrs. Bernard and Canfield may be 
assured that if their test had been available, 
and if some of these questions were similar 
to mine, as they say, they would have received 
due credit. In all probability, if their test had 
been available, we would have used it. But 
the disturbing part is that their test should 
have been used for six years, thousands of 
them having been administered, and yet the 
test does not appear in the literature in such 
a manner that it can be used by others. 


Received January 24, 1952. 
Published out of turn by the editor. 


Editor’s Reply to Bernard’s Criticism 


Donald G. Paterson 


University of Minnesota 


Bernard’s criticism of Rock’s article included 
an attack on the editor for accepting Rock’s 
paper. Perhaps a statement of editorial 
policy is indicated. 

Standards of editorial judgment admittedly 
vary depending upon the subject matter. A 
paper factor analyzing a set of selection tests 
must meet a far stricter standard of judgment 
than one that breaks new ground in a field of 
applied psychology. In other words, if a 
paper opens up new territory for exploration 
the editor is inclined to accept it, not because 
of its technical excellence, but because it is 
likely to lead to new research in a new field. 
Such a paper is sometimes accepted in spite of 
shortcomings. 

The field of selecting salesmen is a difficult 
and challenging one. In addition to published 
data on weighted application blanks, patterned 
interviews, intelligence tests, and measured 
vocational interests, several tests have been 
developed that purport to measure ‘‘sales 
sense,” ‘“‘sales judgment,” or “sales aptitude.” 
At least one of these was developed and 
marketed as early as 1936. But authors and 
distributors of such tests seem to avoid 
describing them in the scientific literature. 


Buros’ Third Mental Measurements Yearbook 
(1949) contains a review of only one such test 
and the facts disclosed about that test are 
most disheartening (see p. 704). For this 
reason, Rock’s article was accepted with 
alacrity. Thus, it becomes the first test of 
this type to be put on top of the scientific 
table for everyone to scrutinize. It is to be 
hoped that this test, like its competitors, will 
now be subjected to independent cross-valida- 
tion. 

Publication of articles does not imply 
editorial endorsement. Neither does the pub- 
lication of Bernard’s criticism imply editorial 
endorsement of his rather sharply worded 
attack. Furthermore, the editor assumes that 
the readers of the Journal of Applied Psychology 
are sufficiently mature and sufficiently com- 
petent not to be easy victims of the fallacy of 
“it is printed and therefore it is fact.” Finally, 
the present editor deliberately avoids any 
semblance of censorship because the idea of 
“thought control” in science is, to him, as 
repugnant as “‘thought control’’ is in a police 
state. 


Received February 18, 1952. 
Published out of turn by the editor. 





Special Review 


Eells, Kenneth, Davis, Allison, Havighurst, 
Robert J., Herrick, Vergil E., and Tyler, 
Ralph. Jntelligence and cultural differences. 
Chicago: University of Chicago Press, 1951. 
Pp. xii plus 388. $5.00. 

This volume is presented as “the first part 
of an extended study of cultural learning as it 
bears upon the solution of problems in mental 
tests.” It is a phase of the research program 
of the Committee on Human Development of 
the University of Chicago. Part III, “A 
Report of the Field Study,” is drawn from 
Eells’s doctoral dissertation. Part II, also 
prepared by Eells, is a summary and discussion 
of his dissertation findings. Part I includes 
five review or discussional chapters, three of 
which are revisions of earlier journal publica- 
tions, by the remaining co-authors listed above. 

The study originates in the preoccupation 
of the Chicago group with phenomena of 
stratification and the social class structure of 
American society. Social class or status is 
perceived as a crucial determinant of personal- 
ity and behavior in various life spheres, and 
as the inhibiting or facilitating force in the 
child’s development to adulthood. In the 
study here reviewed, status, from which flow 
cultural differences in experience and exposure, 
is considered as determinant of responses to 
the specific items of intelligence tests. Because 
such tests are widely used in the managerial 
aspects of our society, the authors are con- 
cerned lest possible cultural bias give spuriously 
low scores to children from low status levels. 
They grant the replicated evidence of correla- 
tion between test scores and socio-economic 
measures, but propose to question again the 
meaning of these differences: are they geneti- 
cally determined; are they environmentally 
determined; or are they the result of cultural 
bias in the content of specific test items? 

At a time when fullest utilization of our 
human resources is more pressing than ever 
from the standpoint of national survival, the 
legitimacy and pertinence of such inquiry are 
unquestioned. But the inquiry itself must be 
both dispassionate and objective. These qual- 
ities are somewhat lacking in the present 


141 


volume. Havighurst and Davis, contributing 
three chapters to Part I, transmute assump- 
tions and hypotheses into foregone conclusions. 
Part I seems to this, reviewer to contain 
particularly flagrant examples of special plead- 
ing, particularly in view of the inconclusive 
findings of the research study itself. Davis 
and Havighurst apparently assume: (1) that 
all methodological questions in the study of 
stratification have been solved; (2) that family 
status is the prepotent determinant of individ- 
ual behavior; and (3) that Eells’s results bear 
out their foregone conclusions, which is 
certainly not the case. 

Turning now to the research section of the 
volume (Part III), Eells undertakes to analyze 
the responses of a large number of pupils to 
items drawn from ten tests or subtests of nine 
widely-used intelligence tests. The testing 
was done in the schools in and around ‘Rock- 
ford, Illinois; approximately 5,000 pupils were 
included, almost equally divided between nine- 
and ten-year olds and thirteen- and fourteen- 
year olds. His basic sample represents well 
over 90 per cent of the population of children 
of these ages, but the analyses essential to his 
hypotheses involve only those pupils with 
parents at the clear extremes of his status scale: 
the younger group contains approximately 
225 high-status and 325 low-status pupils; the 
older group has approximately 235 and 358 
pupils in the respective status groups. Low- 
status “ethnic” groups drop out of the analysis 
early, since they prove to be similar to low- 
status “Old Americans.”’ 

Rockford was chosen as the experimental 
area after applying a set of pragmatic criteria 
which permit no conclusions about its represen- 
tativeness in a strict sampling sense. Sim- 
ilarly, in choosing the two age groups, practical 
factors again appeared to dictate the choice. 
These age groups were designed to show “any 
changes in status differentials” attributable to 
age, but no hypotheses regarding the develop- 
mental time or nature of such possible changes 
are set forth. 

Status measurements are based on a modi- 
fication of the schema set forth in Warner, 














142 


et al., Social Class in America: data on father’s 
occupation, parental education, house-type, 
and dwelling area were obtained from a 
parents’ questionnaire. These items, rated 
and equally weighted, yielded the Index of 
Status Characteristics upon which all the 
families of each age group were separately 
distributed. These distributions were then 
cut “so that the high- and low-status ranges 
would be as nearly equivalent as possible to 
upper-middle class and _ lower-lower class 
groups.” It should be apparent that the 
entire sampling process is designed to maximize 
the chances of proving the hypothesis of 
cultural bias in test items. No middle group 
is used in the item analyses as a check on the 
assumption of cultural bias; no attention is 
paid to possible sex differences in item re- 
sponses of the two age groups; no attention is 
paid to the factorial composition of the two 
sets of tests to determine whether or not 
cross-sectional measures of the same ability 
domains have been employed. 

Two chapters are given over to the correla- 
tional and group difference analyses of total 
test scores in relation to ISC. The findings 
are as might be expected: test scores and ISC 
show significant correlations in the range .20 to 
43; extreme groups on ISC scores show IQ 
differences of about eight to twenty points, 
depending on the test used; some curvilinearity 
exists for some of the tests in relation to ISC. 
It is important to note in this connection that 
the authors miss one point: if class position is 
the major factor in producing the correlation 
between intelligence and socio-economic data, 
the resultant correlations should be much 
higher than those ordinarily obtained. In 
actuality, socio-economic factors account for 
so little of the variance that other factors must 
be operative in producing intelligence test 
performance. 

The last eight of the twenty-three chapters 
contain the evidence crucial to the basic 
hypotheses: item analyses contrasting high- 
and low-status responses to items reached and 
attempted by 95 per cent of the group. As 
Eells himself states with commendable restraint 
in Chapter XVI, “the findings will not be con- 
clusive.”” One must first understand clearly 
how many test items are actually under scru- 
tiny. The original tests contained 967 possible 


Special Review 


items; approximately one-third of them are 
eliminated from the analysis because they are 
not reached and attempted by 95 per cent of 
the pupils. ‘Unstable items” (too hard or too 
easy) are also eliminated. There are 334 
items studied in the younger age group; of these 
53 per cent are significant at the one per cent 
level, 10 per cent at the five per cent level, and 
37 per cent do not reach the five per cent level. 
In the older group, out of 324 items, eighty- 
eight per cent are significant at the one per cent 
level, three per cent at the five per cent level, 
and nine per cent do not reach the five per cent 
level. 

The summaries of these various chapters 
may be paraphrased to cover the findings 
regarding possible causal factors behind the 
status differences. Position of item responses 
shows inconclusive results as a possible causal 
factor. Symbolism (e.g., verbal items vs. 
geometric design items) shows inconclusive 
results, since it was not controlled for “type of 
question.” Type of question is handled by 
setting up fifty-six logically derived categories 
of items (e.g., synonyms, opposites, analogies, 
etc.). When symbolism and type of question 
are simultaneously studied, the results are 
inconclusive. Level of difficulty of items 
shows inconclusive evidence. With respect 
to age, it is concluded that the higher propor- 
tion of items showing significant status differ- 
ences among older children is due to the 
differences in the nature of the test materials 
rather than to “inherent differences in the 
status characteristics of the pupils at the 
two levels.” This evidence appears conclusive. 

The last two chapters involve ‘the analysis 
. . . based largely upon a subjective process”’ 
of seventy-five items showing differences at 
the one per cent level in one or more of the 
wrong-answer responses and twenty-five items 
showing “unusually large status differences”’ 
in the per cent of the two extreme status 
groups giving right answer responses. Ex- 
planations are “hypothecated” where plausible, 
but “the presence of such a large proportion of 
unexplained differences should, however, lead 
to caution in accepting the idea that all status 
differences on test items can readily be 
accounted for in terms of the cultural bias of 
their content” (page 357). 

' Part II, also written by Eells, is listed as a 





Special Review 


summary of the field study and need not be 
reviewed in detail, except to point out that 
he deals with “common culture” and “own 
culture” (subgroup culture) as if the com- 
munalities and disparities of social groups in 
America are completely documented and 
accessible facts. 

It is manifestly impossible to deny the 
impact on the social sciences of the work of 
the last two decades of those who have revealed 
with brilliant descriptions and imaginative 
insights the class problems in American society. 
Certainly there is new strength and scope in 
our research because of this accumulated 
evidence. But descriptions and insights are 


143 


only first steps in research; skillful design and 
testable hypotheses are also needed. The 
sampling, theoretical, and design problems 
suggested within the body of this review raise 
serious doubts about the worth of this study. 
His mentors, it would seem, had a technical 
obligation to help him arrive at a_ better 
thesis design. Failing this, they had a moral 
obligation to revise their thinking in the light 
of his inconclusive findings. Failing either 
or both of these desiderata, the monograph 
should have been given a critical and thorough 
editing before publication. 
John G. Darley 


University of Minnesota 











Book Reviews 


Flesch, Rudolph. How to test readability. 
New York: Harper and Brothers, 1951. Pp. 
56. $1.00. 

This pocket-sized little manual on read- 
ability is remindful of a Culbertson contract 
bridge digest. It presents techniques, offers 
illustrations of how the techniques work, and 
shows that mere techniques alone are not 
enough. Flesch goes Culbertson somewhat 
better, however, in providing a series of 
questions and answers and in listing an excel- 
lent bibliography. The question and answer 
section is reminiscent of Gallup’s Guide to 
Public Opinion Polls. 

Early pages of the manual reprint most of a 
1948 article from this journal which presented 
Flesch’s revised readability formula. This 
how-to-do-it section is supplemented by eleven 
examples of material ranging from the Bible 
to the Adventures of Huckleberry Finn analyzed 
for their readability scores. In these examples, 
sentence breaks are indicated, personal words 
are bold-faced and quantitative values for 
each formula element are provided. These 


paragraphs can be of great usefulness as 
standards both for training analysts to use the 
formula and for checking accuracy and reli- 


ability of trained analysts. Two nomographs 
are provided as calculation aids. 

In a more qualitative section, Flesch lists a 
number of hints for raising readability. Many 
of these, such as knowing the characteristics 
of one’s audience, and rearranging words, 
sentences, and larger units in a piece of writing 
are independent of his formula and of the 
elements which the formula considers. Other 
hints, such as raising interest by the “you” 
approach, finding simpler words, and breaking 
up sentences and paragraphs, consider writing 
variables included in the formula. Here too, 
examples are provided. 

Among the 44 questions which Flesch asks 
and answers one finds succinct discussions of 
reliability and validity problems, other read- 
ability formulae, the effect on style of using 
short words and sentences, and discussion of 
how the formula applies to advertising, news, 
technical, and lega! writing. The answers are 


supported by liberal references to the bibli- 
ography. 

Of particular interest to the reviewer were 
indications that Flesch considers the Human 
Interest portion of his formula more important 
than the Reading Ease portion. In making 
this point, Flesch says that if the reader is 
genuinely interested in what he is reading, 
he may be able to work his way through long 
sentences and difficult words, but primer style 
will not lure a reader to a dull presentation. 
It would seem to the reviewer, however, that 
“genuine” interest is primarily related to 
subject matter, as some recent newspaper 
readership studies have demonstrated. If 
interest is the most important consideration, 
it would seem that a subject-matter content 
analysis approach to the problem of what 
things interest what audience groups would 
yield greater dividends. Of course if subject 
matter can be held constant, the proportions 
of personal words and personal sentences 
assume greater importance. 

All in all, this is a valuable little manual. 
Its modest price and excellent content will 
have wide appeal for all who are concerned 
with improved written communications. 

Robert L. Jones 


Human Resources Research Institute, 
Maxwell Air Base, Alabama 


Travers, Robert M. W. How lo make achieve- 
ment tests. New York: The Odyssey Press, 
1950. Pp. 180. 


This little manual has been written as an 
aid for teachers, in order to help them to de- 
velop objective tests, and to provide them 
with techniques for defining educational goals. 

The introductory chapter indicates the 
modern tendency to attempt more systematic 
and complete evaluation of all the outcomes 
of a course. The need for new-type or ob- 
jective tests to supplement essay-type tests 
in such an evaluation program is recognized. 
A chapter concerned with the making of a 
blueprint for an examination indicates how 
test content can be planned so as to have items 


144 





Book Reviews 


properly allocated in relation to all of the 
course objectives. 

Separate chapters provide discussions of 
the advantages and disadvantages of tests 
of the true-false, multiple-chioce, and com- 
pletion types. These discussions are supple- 
mented by detailed directions as to procedure 
in constructing such test items, with careful 
indication of -pit-falls to be avoided. One 
chapter provides an elementary description 
of procedures in assembling, administering, 
and scoring tests, as well as discussion of such 
topics as the test-item file, the directions to 
pupils, the correction for guessing, and use 
of machine methods in scoring. A_ final 
chapter treats such topics as the significance 
of test scores, ambiguity in grading, the 
validity of achievement tests, and thé use of 
item analysis. An appendix deals with sug- 
gested methods of scoring free-answer and 
essay-type tests. 

The author has admittedly included much 
material that is based on opinion. He has 
certainly not over-sold the new-type or ob- 
jective test. To the present reviewer he 
appears to be too ready to accept current 
criticism of objective tests, and too ready to 


believe in the asserted values of essay-type 
tests. 


The main use of the book will no doubt be 
in guiding the novice in his first attempts at 
construction of new-type or objective test 
items. The text isextremely elementary. The 
sophisticated test worker will be annoyed by 
the treatment which emphasizes test items as 
isolated bits of behavior, neglecting the im- 
plications of test items as samples, signs, 
signals, or symptoms. 

Harold D. Carter 
University of California, 
Berkeley 4, California 


Gulliksen, H. Theory of mental tesis. New 
York: Wiley, 1950. Pp. xix + 486. $6.00. 
Several good texts in tests and measurements 

have appeared in recent years. Some of 

these books have included discussions of item 
construction and all have given at least an 
elementary presentation of the role of statis- 
tics in psychological measurement. But for 
the most part the recent books have’ con- 
centrated upon describing and evaluating 


145 


existing psychological tests. Here, then, is 
a book which breaks with the current tradition. 

Gulliksen is not concerned with existing 
tests. (The Stanford-Binet, for example, does 
not appear in the index.) Rather he is in- 
terested in presenting the mathematical and 
statistical bases of test construction. In 
this respect the treatment is something like 
Thurstone’s early (1931) Reliability and V alid- 
ity of Tests (upon which Gulliksen has admit- 
tedly drawn). But with twenty additional 
years of research and advancement in the 
field of mental measurement, Gulliksen can 
and does go beyond Thurstone. 

“The basic theoretical material on accuracy 
of test scores is presented in Chapters 2 
through 5, which deal with the topics of test 
reliability and the error of measurement. 
The effect of test length upon reliability and 
validity is considered in Chapters 6 through 
9, and the effect of group heterogeneity on 
measures of accuracy in Chapters 10 through 
13... . Practical problems of criteria for 
parallel tests are given in Chapter 14, and 
experimental methods of determining reli- 
ability when a parallel form is not used are 
considered in Chapters 15 and 16. Methods 
of scoring, scaling, and equating tests are 
considered in Chapters 18 and 19. Problems 
dealing with batteries of tests are considered 
in Chapter 20, and problems of item selection 
in Chapter 21” (p. 5). 

An appendix contains a table of the normal 
curve, basic equations from mathematics and 
statistics, and sample examinations in statis- 
tics and test theory. 

It is clear that Gulliksen intended his book 
as a text; yet it seems to this reviewer that 
until such time as psychology departments 
see fit to strengthen their requirements in 
mathematics, the book is going to prove to 
be more valuable as a reference than as a text, 
The student who has had a good year’s course 
in statistics, including the analysis of variance, 
who knows his algebra, analytic geometry, 
calculus, matrices and determinants, will find 
this an excellent and profitable book to study— 
as indeed it is. But it will not go well with 
students who lack the preparation to com- 
prehend it—even if the instructor should pick 
and choose among the various chapters as 
Gulliksen suggests. 








146 


Regarding the book as a reference work 
rather than as a text, it should be a welcome 
addition to the bookshelf of the professional 
worker in the field of mental tests. From this 
point of view Gulliksen has given us a major 
contribution and one that will be with us for 
a long time to come. 

Allen L. 


The University of Washington 


Edwards 


Freeman, G. L., and Taylor, E. K. How to 
pick leaders. New York: Funk & Wagnalls, 
1950. Pp. 222. $3.50. 

Written for “those selecting young men for 
executive training, as well as the aspirants 
themselves,” How to Pick Leaders “attempts 
to distill out of past and current research, the 
common elements of the leadership pattern. 
It then goes on to indicate how such a pattern 
can be employed to improve the search for 
executive talent to eliminate . . . the vagaries 
of unscientific selection practices.”’ 

The book begins with pointing out the 
inefficiency and high cost of most of the present 
day methods of selecting executive trainees. 
Next the criterion problem is discussed, stress- 


ing its importance and a number of ways of 
measuring leadership success are suggested. 
Following a section on the building up and 
administration of a scientific selection program, 
the main portion of the book is devoted to 


selection tools and techniques. Included here 
are recruiting and screening, interviewing, 
aptitude testing, consideration of past perform- 
ance, personality measurement, and rating. 
The book concludes with an overview of the 
total selection program and a section on the 
importance of continuing follow-up of those 
selected. 

The authors have done an especially good job 
of bringing together research findings from a 
number of varied sources. It is refreshing to 
find, in a book written for the lay reader, that 
the authors respect their readers’ intelligence 
enough to include completely footnoted refer- 
ences. Although highly technical subjects 
are discussed, the interesting and down to 
earth way in which the book is written seems 
to insure that it will be understood by those for 
whom it was intended. Many well chosen 
examples and illustrations put life and meaning 


Book Reviews 


into subject matter that might otherwise be 
dry or academic sounding for the lay reader. 

The reader with a knowledge of the litera- 
ture in this area will find an overly optimistic 
tone to the book which is not warranted on the 
basis of research evidence. For example, the 
relative lack of success to date reported by 
researchers who have been seeking a reliable 
criterion does not justify the statement, “Any 
company has the means at hand for getting 
true objective measurements of relative leader- 
ship success.” Neither does research evidence 
back up the implication in the book that valid 
tools to include in the selection program are 
readily available or fairly easy to construct. 
Unfortunately, because of this optimistic tone, 
the lay reader is likely to picture the process as 
a relatively highly developed one from which 
immediate results can be expected. 

Nevertheless, this book does bring together 
into a single volume almost all of the promising 
tools and techniques for the scientific selection 
of executives. The result is an understandable 
and fairly complete source of information for 
the lay reader on this important area of 
selection in business and industry. 

Theodore R. Lindbom 


Prudential Insurance Company of America, 
Newark, New Jersey 


Vernon, P. The structure of human abilities. 
New York: John Wiley and Sons, Inc., 1951. 
Pp. 160. $2.75. 

For those, like the reviewer, whose knowl- 
edge of factor analysis derives almost entirely 
from books rather than from journal articles, 
this latest book of Professor Vernon’s should 
be invaluable. Many years ago Burt at- 
tempted to reconcile the opposed theories of 
Spearman and Thurstone by suggesting the 
hierarchical theory of human abilities; a theory 
inspired by Spencer. Like many another 
conciliator in many another field, Burt found 
himself attacked by the rival schools. Now 
in an appendix to the present book, Vernon 
gives his reasons for preferring the General plus 
Group Factor, or Hierarchical Theory to the 
Multiple Factor Theories. Unfortunately some 
theories, such as Hotelling’s, which seems to 
be the most satisfactory to the mathematical 
statisticians, are dismissed as not justifying 





Book Reviews 


the extra effort of calculation. Perhaps the 
best argument in favour of the Hierarchical 
Theory is that for practical purposes a measure 
of g is essential; and so, whether they like the 
British Theory or not, most American educa- 
tionists, personnel psychologists, and others 
who use psychological tests, do in fact measure 
g, even though they may give it a capital 
letter. 

Vernon has attempted the frightening task 
of reviewing; “‘Almost all the contributions 
from about 1935 to 1949,” and the more 
important works. Much information has 
been reworked to conform with the Hierarchical 
Theory. It is not surprising to find that the 
reworked data fit quite well. By the time 
that the reader has finished this book, which 
has been written with great fairness, he will 
probably find that factor analysis has much 
less to offer than many of its exponents would 
have him believe. Apart from g, the only 
reasonably well-established factors are v:ed 
(verbal, educational, numerical), k:m (prac- 
tical, mechanical, spatial), and the X factor, 
which seems to be a complex affair categorizing 
motivation. This X factor has an air of 


untouchability about it, which its importance 


belies. Reviewers are warned not to criticize 
adversely authors for not doing something 
which they did not set out to do, and it is 
true that Vernon did not “. . . attempt to 
cover studies of personality factors, attitudes 
and interests, or other fields outside abilities.” 
Nevertheless when we find repeated references 
to the importance of this X factor, and 
warnings that it affects test scores to a marked 
degree, we are justified in asking for a fuller 
discussion of this factor. How is it measured? 


147 


How can we calculate its effect on scores? 
Can it be altered? 

For many, the most interesting chapter will 
be that on Occupational Abilities. Evidence 
for broad factors of manual, or finger dexterity 
is lacking, so the use of various tests aimed at 
measuring a non-existent factor should be 
discouraged. The practice, quite common in 
Great Britain, of making a standardized work 
sample for selection purposes is wholly justi- 
fied. Again, memory seems to be a collection 
of specifics, which do not unite to form a group 
factor (is X intruding here too?), wherefore 
then the countless maze experiments of the 
rat men. The separate fields of psychology 
are still remote from each other. The German 
psychologists are treated rather roughly, too 
roughly, for, if factor analysis yields results 
differing according to the method used, then 
it is not surprising that the Germans have 
found results which differ from those of factor 
analysis. Some reference to work on produc- 
tive thinking such as that of Duncker or 
Wertheimer would have been welcome. Many 
tests can be treated by transposing the problem 
into different media of thought. We know too 
little of the effect of this. 

The book deserves to be bought (the price 
is very reasonable), and read. The prose is 
readable, although Vernon stoops to such 
horrible words as “stimulatingness.”’ It should 
certainly help clear the air, and give those 
whose work lies more in the applied, than in the 
theoretical field, a clearer view of what factor 
analysis has done. 

Douglas Irvine 


National Institute of Industrial Psychology 
London, England 








New Books, Monographs, and Pamphlets 


Books, monographs, and pamphlets for listing and possible review should be sent to Donald G. Paterson, Editor, 
Department of Psychology, University of Minnesota, Minneapolis 14, Minnesota 


Factor analysis of reasoning tests. Dorothy C. Adkins 
and Samuel B. Lyerly. Chapel Hill: University of 
North Carolina Press, 1952. Pp. 122. $2.00. 

An introduction to projective techniques. Harold H. 
Anderson and Gladys L. Anderson, Editors. New 
York: Prentice-Hall, Inc., 1951. Pp. 720. $6.75. 

Community planning for human services. Bradley Buell 
and associates. New York: Columbia University 
Press, 1952. Pp. 464. $5.50. 

Childhood problems and the teacher. Charlotte Buhler, 
Faith Smitter, and Sybil Richardson. New York: 
Henry Holt and Co., 1952. Pp. 372. $3.75. 

How much do you know about alcohol. Thomas R. 
Carskadon. New York: Association Press, 1951. 
Pp. 31. $.10. 

Psychology in the service of the school. M. F. Cleugh. 
New York: Philosophical Library, 1951. Pp. 183. 
$3.75. 

Changing attitudes through social contact. Leon Fes- 
tinger and Harold H. Kelley. Ann Arbor: Publica- 
tions Department, Institute for Social Research, 
University of Michigan, 1951. Pp. 83. $1.50. 

The art of clear thinking. Rudolf Flesch. New York: 
Harper and Brothers, 1951. Pp. 212. $2.75. 

Fundamentals of social psychology. Eugene L. Hartley 
and Ruth E. Hartley. New York: Alfred A. Knopf, 
Inc., 1952. Pp. 832. $5.50. 

Group treatment in psycho-therapy. 
and Lydia Hermann. 
Minnesota Press, 1951. Pp. 136. $3.00. 

Speech training. A. Musgrave Horner. New 
Philosophical Library, 1951. Pp. 176. $3.75. 

Human factors in management. Revised edition. Schuy- 
ler Dean Hoslett, Editor. New York: Harper and 
Brothers, 1951. Pp. 327. $4.00. 

Thinking. An introduction to its experimental psychol- 
ogy. George Humphrey. New York: John Wiley 
and Sons, Inc., 1951. Pp. 331. $4.50. 

Cerebral mechanisms in behavior. Lloyd A. Jeffress, 
Editor. New York: John Wiley and Sons, Inc., 1951. 
Pp. 311. $6.50. 

Changing the attitude of Christian toward Jew. 
Enoch Kagan. New York: Columbia University 
Press, 1951. Pp. 155. $2.75. 

The prediction of performance in clinical psychology. 
E. Lowell Kelly and Donald W. Fiske. Ann Arbor: 
University of Michigan Press, 1951. Pp. 311. 

The psychology of adolescent development. Raymond G. 
Kuhlen. New York: Harper and Brothers, 1951. 
Pp. 642. $5.00. 

Sizing up people. Donald A. Laird and Eleanor C. 
Laird. New York: McGraw-Hill Book Co., Inc., 
1951. Pp. 270. $3.75. 

The retarded child. Herta Loewy. 
sophical Library, 1951. Pp. 160. 

The psychology of human learning. 


Robert G. Hinckley 
Minneapolis: University of 


York: 


Henry 


New York: Philo- 
$3.75. 
John A. McGeoch 


and Arthur L. Irion. 
and Co., Inc., 1952. Pp. 596. $5.00. 

Argument of laughter. D.H. Monro. Melbourne Uni- 
versity Press; New York: Cambridge University 
Press, 1951. Pp. 264. $3.75. 

Readings in personnel administration. 
Charles A. Myers. New York: McGraw-Hill Book 
Co., Inc., 1952. Pp. 483. $4.50. 

Social science and psychotherapy for children. 
Pollak, et al. New York: Russell Sage Foundation, 
1952. Pp. 242. $4.00. 

A laboratory manual for social psychology. 
Ray. New York: American Book Co., 1951. Pp. 
173. $3.00. 

Children who hate. Fritz Redl and David Wineman. 
Glencoe, Ill.: The Free Press, 1951. $3.50. 

The psychology of adolescence. Alexander A. Schneiders. 
Milwaukee: Bruce Publishing Co., 1951. Pp. 550. 
$4.00. 

Problems of infancy and childhood. 
Editor. New York: Josiah Macy, Jr. Foundation, 
1951. Pp. 181. $2.25. 

Symposium on the healthy personality. Milton J. E. 
Senn, Editor. New York: Josiah Macy, Jr. Foun- 
dation, 1950. Pp. 298. $2.50. 

Curriculum development as re-education of the teacher. 
George Sharp. New York: Bureau of Publications, 
Teachers College, Columbia University, 1951. Pp. 
132. $3.50. 

Occupational information. Second edition. Carroll L. 
Shartle. New York: Prentice-Hall, Inc., 1952. Pp. 
448. $5.00. 

Diagnosing human relations needs. Hilda Taba, et al. 
Washington, D. C.: American Council on Education, 
1951. Pp. 155. $1.75. 

The study of instinct. N. Tinbergen. New York: Ox- 
ford University Press, 1951. Pp. 228. $7.00. 

Teaching elementary reading. Miles A. Tinker. New 
York: Appleton-Century-Crofts, Inc., 1952. Pp. 
366. $3.50. 

Personal and social adjustment. Wayland F. Vaughan. 
New York: The Odyssey Press, Inc., 1951. Pp. 592. 
$4.25. 

Student ‘personnel work in college. C. Gilbert Wrenn. 
New York: Ronald Press Co., 1951. Pp. 589. $4.75. 

Productivity, supervision and morale among railroad 
workers. Survey Research Center, University of 
Michigan. Ann Arbor: University of Michigan Press, 
1951. $1.50. 

Selecting supervisors. United States Civil Service Com- 
mission. Washington 25, D. C.: Superintendent of 
Documents, U. S. Government Printing Office, 1951. 
Pp. 30. $.15. 

Transactions of the conference on ministry and medicine 
in human relationships. New York: New York Acad- 
emy of Medicine, 1951. Pp. 75. Available upon 
request. 


New York: Longmans, Green 


Paul Pigors and 


Otto 


Wilbert S. 


Milton J. E. Senn, 


148 





HISTORY OF 
AMERICAN PSYCHOLOGY 
by A. A. Roback 


Here is the first history of American 
Psychology ever to appear, showing 
through development stages how this 
vastly significant aspect of human stud 
reached its present importance. The vol- 
ume presents.an over-all picture covering 
three centuries, including the numerous 
divisions and activities of the powerful 
American Psychological Association. 

Author of more than twenty books on 
ome poe, Sana me + ne 
oreign languages), and as one who st 
close to the chief architects of the science, 


Dr. Roback naturally much first- 
hand information. the ever-growing im- 
portance of the subject to students, re- 
searchers, psychologists, and Capen 


t 
laymen renders this an invaluable tool for 
study, reference, and genuine interest. 
Copiously illustrated. $6.00 

Expedite Shipment by prepayment 
LIBRARY PUBLISHERS 
8 West 40th St., New York 18, N. Y. 


BERTRAND RUSSELL’S 
DICTIONARY OF MIND 
MATTER & MORALS 


This exhaustive work offers more than 
1000 definitions and opinions of the 1950 
Nobel Prize winner, arranged as a handy 
key. Here is Russell's challenging thought 
on politics, ethics, philosophy of science, 
epistemology, religion, mathematical phi- 
losophy, and on topics crucial to an under- 
standing of international affairs today. 
Dipped into casually it rewards the 
browser with stimulating and acute intel- 
lectual insights. Read intensively it will 
be found indispensable to a fuller appre- 
ciation of one of the profoundest minds 
of our age. $5.00 


Expedite Shipment by prepayment 


PHILOSOPHICAL LIBRARY 


Publishers 
15 East S0th Street, Desk 186 
New York 16, N- ¥. 








DEIE DE IG DE IK DLIG DEIE DEI DE IZ DELICE 
Distinctive MCGRAW-HILL Zooks 











PSYCHOLOGY IN INDUSTRY 


By J. Srantey Gray, University of Georgia. McGraw-Hill Publications in Psychology. 
401 pages, $5.00 


A basic text containing more factual material than is usually offered in studies of this 
nature. Written in a clear and precise style, the text is informative rather than theoretical 
with all the latest advances included: factors affecting worker efficiency, nutrition, age, 
methods of working, etc. Carefully selected reference readings direct the student to sup- 
plementary study. 


READINGS IN INDUSTRIAL AND BUSINESS PSYCHOLOGY 


By Harry W. Karn and B. vonHa.ier Gimer, Carnegie Institute of Technology. 
McGraw-Hill Series in Psychology. In press 


An outstanding collection of 53 representative articles which point the way toward an 
identification and solution of the more pressing psychological problems in business and 
industry, this book offers an invaluable supplement to an over-all coverage of the field of 
industrial and business psychology. Articles are presented in their original form, and 
each selection is an integrated presentation in itself. 


PSYCHOLOGY IN HUMAN AFFAIRS 


By J. Stantey Gray, University of Georgia. With the Assistance of Eleven Contrib- 
utors. McGraw-Hill Publications in Psychology. 646 pages, $5.25 


This book is a factual treatment of applied psychology. Written for all with a limited 
understanding of the subject the text covers the application of psychology in child develop- 
ere education, human adjustments, speech correction, public opinion and propaganda, 
leisure, etc. 


HUMAN RELATIONS IN SUPERVISION 


By Wittarp E. Parker, Personnel Management Consultants, Chicago; and Robert 
W. Kieemeier, Moosehaven Research Laboratory, Orange Park. 472 pages, $5.00 


Here is an eminently practical and readable work, designed to show the first line supervisor 
how to deal with the problems of supervision which constantly face him. The authors 
present sound management principles and tested practices and apply | modern psychological 
theory to problems of training, motivation, and discipline. sis is placed on the 
supervisor’s role as a morale builder, and the book discusses z= 7 practices of hiring, 
inducting. evaluating, and counseling the worker. 


Send for copies on approval 








McGRAW-HILL BOOK COMPANY, Inc. 


330 West A2nd Street New York 36, N.Y. 








